Next Article in Journal
An Assisted Numerical Simulation Diagnosis Method for Atherosclerosis Based on Hemodynamics
Next Article in Special Issue
EPIIC: Edge-Preserving Method Increasing Nuclei Clarity for Compression Artifacts Removal in Whole-Slide Histopathological Images
Previous Article in Journal
Optimizing Thermal Performance of Mini Heat Exchangers: An Experimental Analysis Using a Full Factorial Design
Previous Article in Special Issue
Stacked Ensembles Powering Smart Farming for Imbalanced Sugarcane Disease Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Decoding Lung Cancer Radiogenomics: A Custom Clustering/Classification Methodology to Simultaneously Identify Important Imaging Features and Relevant Genes

1
School of Engineering and Applied Science, George Washington University, Washington, DC 20052, USA
2
Department of Radiology, School of Medicine and Health Sciences, George Washington University, Washington, DC 20052, USA
3
Department of Radiation Oncology, School of Medicine and Health Sciences, George Washington University, Washington, DC 20052, USA
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(7), 4053; https://doi.org/10.3390/app15074053
Submission received: 26 February 2025 / Revised: 27 March 2025 / Accepted: 2 April 2025 / Published: 7 April 2025

Abstract

:

Featured Application

This paper presents a custom combined clustering-image classification methodology to group genetic mutation patterns representative of lung cancer using CT data. This methodology could be applied to other radiogenomic or image classification problems with partial or completely unlabeled images to cluster final results.

Abstract

Background: This study evaluated a custom algorithm that sought to perform a radiogenomic analysis on lung cancer genetic and imaging data, specifically by using machine learning to see whether a custom clustering/classification method could simultaneously identify features from imaging data that correspond to genetic markers. Methods: CT imaging data and genetic mutation data for 281 subjects with NSCLC were collected from the CPTAC-LUAD and TCGA-LUSC databases on TCIA. The algorithm was run as follows: (1) genetic clusters were initialized using random clusters, binary matrix factorization, or k-means; (2) image classification was run on CT data for these genetic clusters; (3) misclassified subjects were re-classified based on the image classification algorithm; and (4) the algorithm was run until an accuracy of 90% or no improvement after 10 runs. Input genetic mutations were evaluated for potential medical treatments and severity to provide clinical relevance. Results: The image classification algorithm was able to achieve a >90% accuracy after nine algorithm runs and grouped subjects from a starting five clusters to four final clusters, where final image classification accuracy was better than every initial clustered accuracy. These clusters were stable across all three test runs. A total of thirty-eight genes from the top hundred across each subject were identified with specific severity or treatment data; twelve of these genes are listed. Conclusion: This small pilot study presented a potential way to identify genetic patterns from image data and presented a methodology that could group images with no labels or only partial labels for future problems.

1. Introduction

Cancer remains a leading cause of death worldwide with nearly 10 million deaths in 2020 and 19.3 million new cases [1,2]. Of the types of cancers, lung cancer is consistently the deadliest, with nearly 20% of these deaths (1.9 million) in 2020 attributable to it alone [1,2]. Identifying cancer early is paramount. When cancers are detected early, the rate of survival increases dramatically. However, studies have found that 50% of cancers detected are already at an advanced stage [3]. Although the prevention of cancer has long been a common strategy for mitigation, strategies have found that it can lead to over-diagnosis and many cancers still lack any preventative strategies at all [4]. This has made the study of cancer diagnosis and treatment of strong interest.
Despite many advances in clinical medicine and research the past few decades, the diagnosis of cancer has remained a difficult problem [5]. Current cancer diagnostic strategies often rely heavily on imaging. Techniques and screening through methodologies such as positron emission tomography (PET), computed tomography (CT), Magnetic Resonance Imaging (MRI), and molecular diagnostic techniques have strongly contributed to improved rates of early detection and staging [6]. Lung cancer is typically staged into the tumor, node, and metastasis (TNM-8) staging system through CT scans or a combination of PET/CT to stratify the risk and severity of the tumor [7]. As there are currently no validated biomarkers for lung cancer for the prediction of which patients need treatments such as radiotherapies, there has been growing interest in radiomics, which looks to find quantifiable imaging features that could lead to a greater stratification of the risk and severity of lung cancer tumor [8]. The use of deep learning in combination with traditional radiomics analysis has also grown in popularity due to its potential to diagnose and further stratify risk of cancers such as lung cancer [9,10].
The rise of genetic testing and precision medicine has brought with it great potential to unlock more methods of cancer diagnosis and treatment. After the first human genome was sequenced in 2005, various studies have attempted to identify the specific genomic markers of cancer and, especially, lung cancer [11,12,13,14,15,16]. Radiogenomics, or the combination of identifying these genetic factors on traditional imaging techniques, has become a popular potential new methodology and target of cancer research. Radiogenomics offers a noninvasive, faster, genetic screening panel for cancer and attempts to identify these predictive genetic biomarkers from cancer imaging [17,18,19]. The radiogenomics workflow typically consists of the collection of a large group of imaging features such as contrast, intensity, heterozygosity, dynamic features, or software-identified markers from a medical image [20]. Next, these features need to be filtered to a manageable set for potential analysis through deep learning or traditional statistics. Pilot studies have evaluated the use of clustering for the radiogenomic analysis of lung cancer using groups of 100–500 patients to attempt to group these features and simplify the total number of model inputs or to separate the outcome variable to simplify analysis [21,22,23]. Current approaches often are demonstrated and validated through simple statistic tests such as t-tests with corresponding p-values to analyze input variables selected by models and the outcomes [24]. This deep learning based framework has so far demonstrated success for predicting treatment outcomes for cancers such as lung cancer [25,26]. But this methodology is not without limitations. The use of deep learning also comes with an “explainability problem” where it is not clear what the model is selecting, separate model runs can lead to entirely different and non-reproducible results, and it is not clear if these single-gene or single-genetic-pattern analyses conducted by studies are even targeting the correct genetic variables actually predictive of cancer [27,28]. A study of 25,000 patients found that there was no single genomic indicator able to predict cancer’s likelihood of metastasis [29]. This study hypothesized that one way to overcome all of these variables was to investigate clustering and deep learning classification on radiogenomic data together to simultaneously identify relevant genetic “clusters” or patterns in addition to imaging features that identify severe cancer metastases.
This pilot study created and evaluated a custom methodology to identify patterns of radiomic features from genetics and imaging data in combination that, together, can better diagnose cancers such as lung cancer and predict cancer severity. Specifically, this study created a custom clustering algorithm to evaluate whether a machine learning algorithm using imaging data from lung CT scans could identify and cluster relevant genetic markers. This methodology (1) clustered initial genetic data, (2) re-clustered the data based on an image classification model accuracy, and (3) re-assigned misclassified patients to new clusters. By using current clustering methodologies and deep learning image analysis methods, this study sought to capitalize on the automatic feature selection and flexibility of deep clustering methods for genetics while also optimizing accuracy through a convolutional front end as is typical of deep learning and machine learning image classification methods. This study evaluated this methodology on a dataset consisting of genetic and CT imaging data for lung cancer to specifically target a type of cancer that currently lacks biomarkers and could truly benefit from genetic or imaging biomarkers. The final outcome of this process was an algorithm that predicts which patients are likely to fall into which genetic patterns based on their imaging data. This study was particularly novel as it not only presented a new methodology that could be applied elsewhere but specifically used this methodology to also identify genetic patterns indicative of lung cancer severity. This study also evaluated genes used in this methodology to test for severity linkage and potential associated treatments. Although this methodology could easily be applied elsewhere to other medical imaging and non-medical imaging tasks, by including these severity and treatment data in clustering generation, this study hoped to achieve an algorithm that also had clinical viability.

2. Materials and Methods

2.1. Overview

Current radiogenomics workflow works to gather genetic data and medical imaging data separately before comparing these through feature extraction, methods like clustering, and standard statistical tests or machine learning (Figure 1).
For this study, a custom machine learning algorithm was developed and used to classify tumors while identifying genetic prognostic indicators of downstream severity for lung cancer. This architecture functioned using a CNN for classification and a clustering algorithm that generated the proposed outcome variable with each iteration (Figure 2).

2.2. Data Collection

Data were collected from The Cancer Imaging Archive (TCIA) and The Cancer Genome Atlas (TCGA) databases [30]. Computed tomography (CT) and genetic data were collected for lung cancer subjects. Genetic data were grouped into relevant important mutations based on analysis from TCGA. Thus, this study used mutation lists that were able to be matched back to mutated gene and readable by human interpreter instead of the raw string of genetic data. This study specifically gathered data from TCGA-LUAD: The Cancer Genome Atlas Lung Adenocarcinoma Collection and TCGA-LUSC: The Cancer Genome Atlas Lung Squamous Cell Carcinoma Collection [31,32]. Further details about these datasets are listed in Table 1.

2.3. Data Preprocessing

TCGA mutation analysis was used to identify relevant genetic mutations from the data present amongst the patient populations. The TCGA lists this process as first involving data collection from both the primary tumor and matched normal tissue samples, then through pathological review, and nucleic acid extraction (DNA and RNA), the sequencing itself was carried out through next-generation (NGS), microarray, or sanger sequencing; variant calling; and annotation; and the final analysis was conducted through pan-cancer, gene-level, pathway, or survival evaluation [33,34]. This project used the mutation analysis from the primary tumor sequenced via NGS for DNA at gene-level analysis. Mutations were ranked by total patients afflicted across the two datasets (CPTAC-LUAD and TCGA-LUSC). Next, the top 100 mutations for each group were selected as potential candidates to be fed into the custom clustering/classification algorithm. These top 100 mutations were evaluated for linkage to cancer severity or potential targeted genetic treatments. A final 38 initial genes were identified that were present across most patients, had severity linkage, or had associated specific treatments. Genes were grouped for each patient into a binary matrix such that presence of mutation was indicated by “1” and lack of mutation for that patient was indicated by “0”. CT image data for the lung cancer patients were collected from the TCIA website for the above patients. Single-slice CT images for each patient with identified lung tumor were identified for input into the image processing model. Single-slice CT tumor was selected based on presence of tumor and TCIA criteria. When available, single-slice selections and annotations were validated through the IDC Zenodo automatic segmentations as provided on the TCIA website for TCGA LUAD and LUSC [35,36]. However, radiologist annotations were considered paramount when available. If a patient had multiple images with tumors, each was treated as an individual image. These images, however, were not separated into separate training and testing sets. What this meant was that if a patient had multiple lung CT slices, it was ensure that each patient sample remained in the same cohort (training or testing). This was carried out to ensure patient data did not appear separately across both cohorts to prevent risk of bias and overfitting, or specifically to keep the model predicting patient data and not imaging data. This came out to be 22,785 images across 281 patients and the 2 datasets.
Imaging data were preprocessed according to Tensorflow default load and pipeline through the keras.utils package. This involved loading the dataset into a Tensorflow object and default Tensorflow data augmentation through rotation and flipping of the images. Imaging features were selected by convolutional layer present with the CNN model, which involved sliding a kernel along each image to select potential regions for analysis, passing these potential features through a downsampling layer, and repeatedly passing the features through these layers to produce a final set of weighted feature maps that were sent to a multi-class entropy and predictive function to produce the ranked score the model used to predict whether an image was of a certain class or not. The advantage of using a CNN was the automatic feature selection; however, as these maps sought to maximize the model predictions and not the needs of the clinician, there could often be thousands of these feature maps that can seemed abstract and difficult to interpret, so analysis had to be performed with care. This project did not seek to analyze the specific feature maps present within the imaging model as a portion of this pilot study. But for further analysis, it would be interesting to focus on whether the highest-weighted feature maps contained lung cancer, and what regions of the lung were contained within the thousands of feature maps created from the CNN.
This project utilized two datasets from the TCGA and TCIA websites to attempt to tackle the generalizability issue often present within machine learning studies. Future work on even more datasets is recommended to encourage generalizability.

2.4. Algorithm Overview

Clustering Creation and Customization:
(1)
Initial clusters were generated for genetic groupings using a few test methods:
  • Binary Matrix Factorization: Binary matrix factorization was used to generate initial clusters, which generated a large m × n matrix of multiple genes as columns per subject where genetic activation was displayed as 0/1 for ‘not activated’ or ‘activated’.
  • Random Assignment: Initial clusters were also randomly initialized such that initial clusters of genetic groups would have equal distributions of patients in each.
  • K-Means: Initial clusters were also grouped using a k-means clustering process across the 100 columns to generate a set of 5 initial clusters.
Genetic data for the patients in the 0/1 matrix were grouped into these genetic groups using the aforementioned methods. This study used 5 initial groupings to simply the analysis. However, number of genes and total clusters were easily scalable based on user preference.
(2)
Next, a deep learning image classification algorithm was run on the medical imaging data for each subject with the outcome pointed at the generated genetic clusters. Standard Accuracy, AUC, and loss were generated to identify accuracy of model. Loss metric utilized a proposed deep clustering metric (Ratio of Training Loss/Testing Loss). This proposed algorithm used single-slice CT data for each patient, where each CT slice was confirmed to have lung tumor present.
Overview: This study used a standard Convolutional Neural Network process to build the deep learning algorithm to classify inputs:
  • Kernel Normalization: Standard CNN kernel with gradient calculation and sliding technique was used to identify initial features. DICOM data are unique in that the values of the Hounsfield units from CT scans or MRI intensity values often vary from pixel to pixel. Standard feature selection and edge detection techniques often evaluate the gradient between pixels. To ensure that small fluctuations in DICOM pixel data do not generate false-positive features based on perceived large changes in the gradient between neighboring pixels, the pixels will first be averaged for each receptive field according to the kernel. Gaussian blur is avoided to not lose info and edge detection is avoided on the raw data to not generate false edges or messy images.
  • Pooling and Final Fully Connected Layers: Each of these shapes was then fed into a pooling layer. A standard nonlinear function (ReLU) was applied.
  • Activation Function: These feature maps were passed through a final layer to determine class scores. A final activation function such as a Softmax function would be used to normalize the input into a probability distribution based on Luce’s choice axiom
  • Output: Thus, the final output of a model, similar to other model outputs, was a ranked probability score and corresponding classification that identified whether a series of shapes identified from a DICOM image corresponded to one class or another based on the presence or absence of similar shapes in other images of that class.
For this experiment, custom classification method in addition to standard preset model layer groupings available within the Tensorflow applications package were experimented upon to determine optimal accuracy for model. Residual Neural Network (ResNet) framework was selected as the ultimate model of choice for the classification layer due to its ready availability, popularity in literature, ability to perform well on small datasets, and speed. Models were trained “from-scratch” wherein every layer was retrained, using a “transfer learning process”, where weights from a separate dataset were used to initialize the model and only the final layer was retrained on the lung image data, and a hybrid process, where a transfer learning process was used to train the first layers and 5 additional dense and drop out layers were added at the end to also leverage the “from-scratch” aspect of customization on the data. Final ResNet model reported here was a hybrid model where initial features and weights from the ImageNet dataset were used followed by an additional 5 dense and drop out layers before the classification function to output a ranked probability score. Default Tensorflow parameters were used for this model: learning rate lr = 2 × 10−5, batch size 32, training epochs 25, classification/clustering epochs of 10 (or until threshold of 0.90 accuracy, which was what 10 runs worked out to be for this pilot study), and data partitioning of 80/20 training/testing split such that standard 5-fold cross-validation could be used.
(3)
The loss, accuracy, and AUC were then used to generate new clusters and move groupings to retrain model. To do this prediction results for each patient were generated according to trained image classification model. If a patient was misclassified, they were moved into a new cluster. The image classification model was then re-run on these new clusters to identify new loss, accuracy, and AUC metrics. This process was repeated until the model could achieve at least a 90% accuracy on the data.
To prevent perpetual retraining of algorithms, a threshold of 10 runs was set such that if algorithm did not improve after 10 iterations of re-clustering, the process would also finish. The threshold and potential accuracy were able to be defined by the user and thus completely customizable. Model accuracy for the first few runs on the different types of randomly initialized clusters was reported in addition to final model accuracy.
Final Model: The final model framework consisted of this process of (1) initializing genetic clusters, (2) training an image classification algorithm, and (3) then adjusting the clusters based on the results of the image classification results. In this method, the clustering algorithm and image classification algorithm both sought to improve each other. Validation metrics sought to optimize accuracy of the predicted clusters themselves and to optimize the separation of patients into groupings.
The final model is summarized in Figure 3.

2.5. Outcomes

The outcomes that this model would be targeted to predict would aim to accomplish two goals: (1) to create clusters of genetic profiles and (2) to use both the images and these clusters to identify markers of more severe cancers or cancers that had associated treatments. This study was limited to outcomes of non-small-cell lung cancers (NSCLCs) as those were the cancers present within the chosen datasets. NSCLC cancer types of Adenocarcinoma and Squamous Cell Carcinoma were the two main cancers targeted by this study.

2.6. Statistical Analysis

For this model after 10 runs, the threshold of 0.90 was validated through 5-fold cross-validation and a shuffle test where clustering labels were re-labeled to see whether any following run could achieve the same accuracy of 0.90. Final model used in this was considered statistically significant (p < 0.05; shuffle test achieved no runs at 0.90).
To account for any potential biases present within the differences between the two datasets, additional analysis of the demographic and staging information present within the clinical data was conducted. Average age, race, gender, demographic, and stage were compared.

3. Results

3.1. Initial Genetic Clusters and Cancer Severity Data

This experiment used data from the TCIA and TCGA LUSC and LUAD collections, which consisted of genetic and imaging data across the associated patients. This corresponded to the following—TCGA-LUSC: 74 studies, 37 participants, 36,518 images, and 504 Sequencing Reads; CPTAC-LUAD: 152 studies, 69 participants, 48,931 images, and 582 Sequencing Reads. There were 22,785 total CT lung scans with contrast.
There were 38 genes ultimately identified as containing relevant genetic data linked to lung cancer severity and/or treatment. The top 12 genes present across a large number of patients and imaging scans are reported below in Table 2.

3.2. Model Results

The model achieved initial accuracy values using the Hybrid-ResNet algorithm of 7% on the five randomly initialized clusters, 70% on the Binary-Matrix Clustered clusters, and 72% on the k-means Initialized 5 Clusters.
Model was then run until 90% threshold and was able to achieve 92.3% in 9 cycles. All 3 methods were able to achieve 90% accuracy using this re-clustered method within this preset.
The initial 5 clusters became 4 clusters after this process. The breakdown of genetic groupings and total patients is summarized in Figure 4.
To account for any potential biases in the data itself, the clinical data were also analyzed to account for any potential differences. (Table 3) Overall these results appear to show that the two datasets consisted of relatively similar patient populations, with one having disproportionately more men than the other.

4. Discussion

This study detailed a custom genetic clustering and image classification algorithm and then presented the results of running it on a small set of lung CT and genetic data. This algorithm was fast and efficient, and successfully grouped genetic clusters, and it was able to link the genetic patterns to the imaging data. This methodology is especially interesting in that it could take a traditional supervised machine learning/deep learning problem (image classification) and leverage a clustering methodology to make the final labels also function in an unsupervised way.
Although no prior algorithm had attempted to perform this same process on medical imaging and cancer data for lung cancer in this way, many similar algorithms persist in the literature. Zhao et al. fed 10 genes associated with genetic mutations for lung cancer into a Fuzzy C-Means algorithm and generated five groupings based on image input before reporting the Silhouette Coefficients (SCs) (0.68 for their study) and Davies–Bouldin Index (DBI) (2.35 for their study) [37]. Li et al. used a combined clustering/generative image strategy by segmenting lung tumors using a U-Net, using a CNN with genetic and imaging input to predict gene from imaging data and then using a Generalized Adversarial Network to generate new tumor data [38]. Buda et al. used a CNN and transfer learning combined with a pre-identified set of genomic clusters to predict said clusters using MRI data (AUC 0.6–0.7) [39]. Hoivik et al. used a U-Net to segment images, two radiologist-defined imaging features, and k-means to generate two clusters from 11 genomic signatures to create a cluster where >50% of patients in one cluster had high risks of cancer [40]. These studies used a small number of pre-selected genes (10) and frequently only generated as few as five clusters. Some of the studies (Zhao et al.) [37] that were most similar to this algorithm reported worse accuracy than our study (0.9 > 0.7 AUC). It is important to note that this algorithm was not the same as this and the other likeminded studies did not report comparable metrics such as the AUC or accuracy. It would be interesting in future work to compare how the results presented here compare to others published should these other studies also report the accuracy and AUC for any classification methods used. These past studies also did not take into account the clinical relevance of the genetic clusters or whether they had any linkage to cancer severity. One major advantage of the algorithm and methods presented here is that they consider clinical relevance and do not require the manual selection of genes or imaging input. Large groupings of genes can be retrospectively analyzed. By iteratively re-clustering using the accuracy of an image classification model, this algorithm was also replicable across three different initialization strategies (random, binary matrix factorization, and k-means), and despite differing initial accuracies of the image classification algorithm, got to a similar classification accuracy (>90%) after only a short number of runs (nine).
This study was limited in that it used a small dataset and only tested a small number of possible outputs. This framework could be applied to a large number of additional problems and would benefit from the inclusion of more data. One could take a large dataset of unlabeled or partially unlabeled images that are similar, cluster them initially, and re-cluster them to identify final groupings of said images. Additionally, this study limited the initial gene clustering to 100 of the top genes. This could be further tested using more genes, potentially even thousands, to test whether clusters come out stable and of high accuracy. This study also only limited the scope to one type of cancer (lung) and one specific sub-type of cancer, non-small-cell lung cancer (NSCLC). However, as this was a pilot study on a small number of samples, this could easily be expanded. One thing to note is that as this study was able to consistently re-classify clusters into similar groupings based on accuracy, it is possible that if it were to run indefinitely, it could re-cluster everything into an artificially small number of clusters with 100% accuracy and overfit the data. Thus, it is important to apply some sort of threshold or patience to prevent this, as is the case with many machine learning models. For this study, we attempted to address potential overfitting through the use of multiple datasets, by stopping after a relatively low threshold and minimal number of runs (traditionally called epochs), and by not letting the algorithm run to completion. It would be interesting to explore, in future work, at what point an optimal threshold is reached on average for a number of runs, and whether there is some optimal number of runs or whether all machine learning models must vary dependent on the data itself.
The genetic clusters provided interesting insight into potential patterns of cancer inheritance as well. KRAS is currently available to be treated by Sotorasib (Lumakras) and adagrasib (Krazati) and was grouped into its own cluster [41,42,43]. Similarly ALK, CDKN2, KDR, and NTRK fell into a cluster, as did TP53, EGFR, CDKN2, and BRAF. It would be interesting to see whether these clusters would hold across a larger number of data samples or whether this was specific to the populations of NSCLC subjects present in this data. This study did not attempt to make grandiose claims or thoroughly evaluate these clusters as the final goal was to present a pilot study on a small number of samples.

5. Conclusions

This study presented a custom combined genetic clustering and image classification algorithm that together functioned to take 100 initial genetic inputs and group them into four distinct genetic clusters within nine algorithm runs based on image classification accuracy from a hybrid ResNet model (>90%). This algorithm was able to group this data into these four clusters within a small number of runs (nine) and remained stable across three initialization strategies (random, binary matrix factorization, and k-means). This algorithm was able to achieve a greater accuracy after re-classifying subjects according to model accuracy than initial clusters across all runs. This presents a combined unsupervised–supervised algorithm that could be applied to future contexts.

Author Contributions

Conceptualization, D.P., J.P.L., S.G. and Y.J.R.; methodology, D.P.; software, D.P.; validation, D.P., J.P.L., S.G. and Y.J.R.; formal analysis, D.P.; investigation, D.P., J.P.L., S.G. and Y.J.R.; resources, D.P., J.P.L., S.G. and Y.J.R.; data curation, D.P.; writing—original draft preparation, D.P.; writing—review and editing, D.P., J.P.L., S.G. and Y.J.R.; visualization, D.P.; supervision, D.P., J.P.L., S.G. and Y.J.R.; project administration, D.P., J.P.L., S.G. and Y.J.R.; funding acquisition, D.P., J.P.L., S.G. and Y.J.R.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded partially by a grant from the American Cancer Society.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Acknowledgments

The results shown here are, in whole or part, based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga (accessed on 25 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Kratzer, T.B.; Bandi, P.; Freedman, N.D.; Smith, R.A.; Travis, W.D.; Jemal, A.; Siegel, R.L. Lung cancer statistics, 2023. Cancer 2024, 130, 1330–1348. [Google Scholar] [CrossRef] [PubMed]
  2. Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef] [PubMed]
  3. Crosby, D.; Bhatia, S.; Brindle, K.M.; Coussens, L.M.; Dive, C.; Emberton, M.; Esener, S.; Fitzgerald, R.C.; Gambhir, S.S.; Kuhn, P.; et al. Early detection of cancer. Science 2022, 375, eaay9040. [Google Scholar] [CrossRef] [PubMed]
  4. Loomans-Kropp, H.A.; Umar, A. Cancer prevention and screening: The next step in the era of precision medicine. NPJ Precis. Oncol. 2019, 3, 3. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  5. Hamilton, W. Cancer diagnosis in primary care. Br. J. Gen. Pr. 2010, 60, 121–128. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  6. Pulumati, A.; Pulumati, A.; Dwarakanath, B.S.; Verma, A.; Papineni, R.V.L. Technological advancements in cancer diagnostics: Improvements and limitations. Cancer Rep. 2023, 6, e1764. [Google Scholar] [CrossRef]
  7. Archer, J.M.; Truong, M.T.; Shroff, G.S.; Godoy, M.C.B.; Marom, E.M. Imaging of Lung Cancer Staging. Semin. Respir. Crit. Care Med. 2022, 43, 862–873. [Google Scholar] [CrossRef]
  8. Walls, G.M.; Osman, S.O.S.; Brown, K.H.; Butterworth, K.T.; Hanna, G.G.; Hounsell, A.R.; McGarry, C.K.; Leijenaar, R.T.H.; Lambin, P.; Cole, A.J.; et al. Radiomics for Predicting Lung Cancer Outcomes Following Radiotherapy: A Systematic Review. Clin. Oncol. 2022, 34, e107–e122. [Google Scholar] [CrossRef]
  9. Avanzo, M.; Stancanello, J.; Pirrone, G.; Sartor, G. Radiomics and deep learning in lung cancer. Strahlenther. Onkol. 2020, 196, 879–887. [Google Scholar] [CrossRef]
  10. Lander, E.S.; Linton, L.M.; Birren, B.; Nusbaum, C.; Zody, M.C.; Baldwin, J.; Devon, K.; Dewar, K.; Doyle, M.; FitzHugh, W.; et al. Initial sequencing and analysis of the human genome. Nature 2001, 409, 860–921. [Google Scholar]
  11. Koh, D.-M.; Papanikolaou, N.; Bick, U.; Illing, R.; Kahn, C.E.; Kalpathi-Cramer, J.; Matos, C.; Martí-Bonmatí, L.; Miles, A.; Mun, S.K.; et al. Artificial intelligence and machine learning in cancer imaging. Commun. Med. 2022, 2, 133. [Google Scholar] [CrossRef]
  12. Cerami, E.; Gao, J.; Dogrusoz, U.; Gross, B.E.; Sumer, S.O.; Aksoy, B.A.; Jacobsen, A.; Byrne, C.J.; Heuer, M.L.; Larsson, E.; et al. The cBio cancer genomics portal: An open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2012, 2, 401–404. [Google Scholar]
  13. Rubio-Perez, C.; Tamborero, D.; Schroeder, M.P.; Antolín, A.A.; Deu-Pons, J.; Perez-Llamas, C.; Mestres, J.; Gonzalez-Perez, A.; Lopez-Bigas, N. In silico prescription of anticancer drugs to cohorts of 28 tumor types reveals targeting opportunities. Cancer Cell 2015, 27, 382–396. [Google Scholar]
  14. Borczuk, A.C.; Toonkel, R.L.; Powell, C.A. Genomics of lung cancer. Proc. Am. Thorac. Soc. 2009, 6, 152–158. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  15. Zhang, T.; Joubert, P.; Ansari-Pour, N.; Zhao, W.; Hoang, P.H.; Lokanga, R.; Moye, A.L.; Rosenbaum, J.; Gonzalez-Perez, A.; Martínez-Jiménez, F.; et al. Genomic and evolutionary classification of lung cancer in never smokers. Nat. Genet. 2021, 53, 1348–1359. [Google Scholar] [CrossRef] [PubMed]
  16. Restrepo, J.C.; Dueñas, D.; Corredor, Z.; Liscano, Y. Advances in Genomic Data and Biomarkers: Revolutionizing NSCLC Diagnosis and Treatment. Cancers 2023, 15, 3474. [Google Scholar] [CrossRef]
  17. Jansen, R.W.; Van Amstel, P.; Martens, R.M.; Kooi, I.E.; Wesseling, P.; De Langen, A.J.; Menke-Van der Houven van Oordt, C.W.; Jansen, B.H.E.; Moll, A.C.; Dorsman, J.C.; et al. Non-invasive tumor genotyping using radiogenomic biomarkers, a systematic review and oncology-wide pathway analysis. Oncotarget 2018, 9, 20134–20155. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  18. Rosenstein, B.S.; West, C.M.; Bentzen, S.M.; Alsner, J.; Andreassen, C.N.; Azria, D.; Barnett, G.C.; Baumann, M.; Burnet, N.; Chang-Claude, J.; et al. Radiogenomics: Radiobiology enters the era of big data and team science. Int. J. Radiat. Oncol. Biol. Phys. 2014, 89, 709–713. [Google Scholar] [CrossRef]
  19. Nie, K.; Al-Hallaq, H.; Li, X.A.; Benedict, S.H.; Sohn, J.W.; Moran, J.M. NCTN Assessment on Current Applications of Radiomics in Oncology. Int. J. Radiat. Oncol. Biol. Phys. 2019, 104, 302–315. [Google Scholar] [CrossRef] [PubMed]
  20. Liu, Z.; Duan, T.; Zhang, Y.; Weng, S.; Xu, H.; Ren, Y.; Zhang, Z.; Han, X. Radiogenomics: A key component of precision cancer medicine. Br. J. Cancer 2023, 129, 741–753. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  21. Xu, Y.; Hosny, A.; Zeleznik, R.; Parmar, C.; Coroller, T.; Franco, I.; Mak, R.H.; Aerts, H.J. Deep Learning Predicts Lung Cancer Treatment Response from Serial Medical Imaging. Clin. Cancer Res. 2019, 25, 3266–3275. [Google Scholar] [CrossRef]
  22. Tu, W.; Sun, G.; Fan, L.; Wang, Y.; Xia, Y.; Guan, Y.; Li, Q.; Zhang, D.; Liu, S.; Li, Z. Radiomics signature: A potential and incremental predictor for EGFR mutation status in NSCLC patients, comparison with CT morphology. Lung Cancer 2019, 132, 28–35. [Google Scholar] [CrossRef]
  23. Jia, T.-Y.; Xiong, J.-F.; Li, X.-Y.; Yu, W.; Xu, Z.-Y.; Cai, X.-W.; Ma, J.-C.; Ren, Y.-C.; Larsson, R.; Zhang, J.; et al. Identifying EGFR mutations in lung adenocarcinoma by noninvasive imaging using radiomics features and random forest modeling. Eur. Radiol. 2019, 29, 4742–4750. [Google Scholar] [CrossRef] [PubMed]
  24. Nishino, M. Radiomics-based Cluster Groups to Predict Clinical-Pathologic and Genomic Characteristics of Stage I Lung Adenocarcinoma. Radiology 2022, 303, 673–674. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  25. Nair, V.S.; Gevaert, O.; Davidzon, G.; Napel, S.; Graves, E.E.; Hoang, C.D.; Shrager, J.B.; Quon, A.; Rubin, D.L.; Plevritis, S.K. Prognostic PET 18F-FDG uptake imaging features are associated with major oncogenomic alterations in patients with resected non-small cell lung cancer. Cancer Res. 2012, 72, 3725–3734. [Google Scholar] [CrossRef]
  26. Gandhi, Z.; Gurram, P.; Amgai, B.; Lekkala, S.P.; Lokhandwala, A.; Manne, S.; Mohammed, A.; Koshiya, H.; Dewaswala, N.; Desai, R.; et al. Artificial Intelligence and Lung Cancer: Impact on Improving Patient Outcomes. Cancers 2023, 15, 5236. [Google Scholar] [CrossRef]
  27. Berenguer, R.; del Rosario Pastor-Juan, M.; Canales-Vazquez, J.; Castro-García, M.; Villas, M.V.; Masilla Legorburo, F.; Sabater, S. Radiomics of CT Features May Be Nonreproducible and Redundant: Influence of CT Acquisition Parameters. Radiology 2018, 288, 407–415. [Google Scholar] [CrossRef]
  28. Caramella, C.; Allorant, A.; Orlhac, F.; Bidault, F.; Asselain, B.; Ammari, S.; Jaranowski, P.; Moussier, A.; Balleyguier, C.; Lassau, N.; et al. Can we trust the calculation of texture indices of CT images? A phantom study. Med. Phys. 2018, 45, 1529–1536. [Google Scholar] [CrossRef]
  29. Nguyen, B.; Fong, C.; Luthra, A.; Smith, S.A.; DiNatale, R.G.; Nandakumar, S.; Walch, H.; Chatila, W.K.; Madupuri, R.; Kundra, R.; et al. Genomic characterization of metastatic patterns from prospective clinical sequencing of 25,000 patients. Cell 2022, 185, 563–575.e11. [Google Scholar] [CrossRef]
  30. Clark, K.; Vendt, B.; Smith, K.; Freymann, J.; Kirby, J.; Koppel, P.; Moore, S.; Phillips, S.; Maffitt, D.; Pringle, M.; et al. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. J. Digit. Imaging 2013, 26, 1045–1057. [Google Scholar] [CrossRef]
  31. National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC). The Clinical Proteomic Tumor Analysis Consortium Lung Squamous Cell Carcinoma Collection (CPTAC-LSCC) (Version 15) [Data Set]; The Cancer Imaging Archive: Little Rock, AR, USA, 2018. [Google Scholar] [CrossRef]
  32. Kirk, S.; Lee, Y.; Kumar, P.; Filippini, J.; Albertina, B.; Watson, M.; Rieger-Christ, K.; Lemmerman, J. The Cancer Genome Atlas Lung Squamous Cell Carcinoma Collection (TCGA-LUSC) (Version 4) [Data Set]; The Cancer Imaging Archive: Little Rock, AR, USA, 2016. [Google Scholar] [CrossRef]
  33. Tomczak, K.; Czerwińska, P.; Wiznerowicz, M. Review The Cancer Genome Atlas (TCGA): An immeasurable source of knowledge. Contemp. Oncol. 2015, 19, A68–A77. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  34. Silva, T.C.; Colaprico, A.; Olsen, C.; D’Angelo, F.; Bontempi, G.; Ceccarelli, M.; Noushmehr, H. TCGA Workflow: Analyze cancer genomics and epigenomics data using Bioconductor packages. F1000Research 2016, 5, 1542. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  35. Fedorov, A.; Longabaugh, W.J.; Pot, D.; Clunie, D.A.; Pieper, S.; Aerts, H.J.; Homeyer, A.; Lewis, R.; Akbarzadeh, A.; Bontempi, D.; et al. NCI imaging data commons. Cancer Res. 2021, 81, 4188. [Google Scholar]
  36. Isensee, F.; Jaeger, P.F.; Kohl, S.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [PubMed]
  37. Zhao, Z.; Zhao, J.; Song, K.; Hussain, A.; Du, Q.; Dong, Y.; Liu, J.; Yang, X. Joint DBN and Fuzzy C-Means unsupervised deep clustering for lung cancer patient stratification. Eng. Appl. Artif. Intell. 2020, 91, 103571. [Google Scholar] [CrossRef]
  38. Li, S.; Han, H.; Sui, D.; Hao, A.; Qin, H. A Novel Radiogenomics Framework for Genomic and Image Feature Correlation using Deep Learning. In Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Madrid, Spain, 3–6 December 2018. [Google Scholar] [CrossRef]
  39. Buda, M.; AlBadawy, E.A.; Saha, A.; Mazurowski, M.A. Deep Radiogenomics of Lower-Grade Gliomas: Convolutional Neural Networks Predict Tumor Genomic Subtypes Using MR Images. Radiol. Artif. Intell. 2020, 2, e180050. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  40. Hoivik, E.A.; Hodneland, E.; Dybvik, J.A.; Wagner-Larsen, K.S.; Fasmer, K.E.; Berg, H.F.; Halle, M.K.; Haldorsen, I.S.; Krakstad, C. A radiogenomics application for prognostic profiling of endometrial cancer. Commun. Biol. 2021, 4, 1363. [Google Scholar] [CrossRef]
  41. Xia, T.; Kumar, A.; Fulham, M.; Feng, D.; Wang, Y.; Kim, E.Y.; Jung, Y.; Kim, J. Fused feature signatures to probe tumour radiogenomics relationships. Sci. Rep. 2022, 12, 2173. [Google Scholar] [CrossRef]
  42. Malhotra, J.; Nguyen, D.; Tan, T.; Semeniuk Iii, G.B. Management of KRAS-mutated non-small cell lung cancer. Clin. Adv. Hematol. Oncol. HO 2024, 22, 67–75. [Google Scholar]
  43. Jänne, P.A.; Riely, G.J.; Gadgeel, S.M.; Heist, R.S.; Ou, S.I.; Pacheco, J.M.; Johnson, M.L.; Sabari, J.K.; Leventakos, K.; Yau, E.; et al. Adagrasib in Non-Small-Cell Lung Cancer Harboring a KRASG12C Mutation. N. Engl. J. Med. 2022, 387, 120–131. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Current radiogenomics workflow. The image details a process of genetic sample data collection, mutation identification, mutation clustering, and comparison to image classification model.
Figure 1. Current radiogenomics workflow. The image details a process of genetic sample data collection, mutation identification, mutation clustering, and comparison to image classification model.
Applsci 15 04053 g001
Figure 2. Workflow for new algorithm proposed and evaluated in this study. This study attempted to build a custom clustering/classification methodology that related the imaging features present within medical imaging data for lung cancer to the genetic mutation data present for lung cancer by clustering initial genetic clusters, running an image classification model on the results of the clusters, and re-clustering the data should the accuracy of the model not perform well. This was carried out until an optimal preset accuracy and AUC were reached.
Figure 2. Workflow for new algorithm proposed and evaluated in this study. This study attempted to build a custom clustering/classification methodology that related the imaging features present within medical imaging data for lung cancer to the genetic mutation data present for lung cancer by clustering initial genetic clusters, running an image classification model on the results of the clusters, and re-clustering the data should the accuracy of the model not perform well. This was carried out until an optimal preset accuracy and AUC were reached.
Applsci 15 04053 g002
Figure 3. Overview of final experimental algorithm. This simplified view presents an overview of the experiments that led to the algorithm that clustered the initial genetic data, ran an image classification algorithm on these clusters to identify accuracy, and then re-clustered the data should the accuracy be low. This was carried out until an optimal accuracy and AUC were reached. In this way, the algorithm could identify what genetic clusters were best categorized by the imaging features selected by the CNN for the medical imaging data.
Figure 3. Overview of final experimental algorithm. This simplified view presents an overview of the experiments that led to the algorithm that clustered the initial genetic data, ran an image classification algorithm on these clusters to identify accuracy, and then re-clustered the data should the accuracy be low. This was carried out until an optimal accuracy and AUC were reached. In this way, the algorithm could identify what genetic clusters were best categorized by the imaging features selected by the CNN for the medical imaging data.
Applsci 15 04053 g003
Figure 4. Total genetic groupings and patients present within each cluster.
Figure 4. Total genetic groupings and patients present within each cluster.
Applsci 15 04053 g004
Table 1. Overview of top 12 genes present across a large majority of patients present in the data and with linkage to cancer severity or treatment.
Table 1. Overview of top 12 genes present across a large majority of patients present in the data and with linkage to cancer severity or treatment.
DatasetCancer TypeCancer
Location
Total PatientsImaging
Available
Other Data Available
CPTAC-LUADAdenocarcinomaLung244CT, MR, PT, CR, pathologyClinical,
genomics,
proteomics
TCGA-LUSCLung Squamous Cell
Carcinoma
Lung37CT, NM, PT, pathologyClinical,
genomics
Total----281----
Table 2. Overview of top 12 genes with mutations present across a large majority of patients present in the data and with linkage to cancer severity or treatment. These top 12 genes were the genes with the largest numbers of subjects present within the data and relevant tumor mutations as identified by the analysis performed by the TCGA cohorts through sequencing of the primary tumor and genetic analysis.
Table 2. Overview of top 12 genes with mutations present across a large majority of patients present in the data and with linkage to cancer severity or treatment. These top 12 genes were the genes with the largest numbers of subjects present within the data and relevant tumor mutations as identified by the analysis performed by the TCGA cohorts through sequencing of the primary tumor and genetic analysis.
Outcome Linked Genetic MutationTotal Subjects
TP53187
KRAS157
KEAP1148
CDKN2101
EGFR93
KDR89
NTRK86
STK171
ROS169
SMARCA65
ALK60
BRAF53
Table 3. Overview of differences present within the clinical data for the two datasets used in this study. Average age and percentage present of each demographic, gender, race, and stage is reported to provide a comparison.
Table 3. Overview of differences present within the clinical data for the two datasets used in this study. Average age and percentage present of each demographic, gender, race, and stage is reported to provide a comparison.
LUADLUSC
Average Age65.3367.25
Race
Not reported3%0%
Hispanic or Latino1%2%
Not Hispanic or Latino75%65%
Not reported20%31%
Unknown1%1%
Gender
Female54%24%
Male44%76%
Demographic
Not reported3%0%
American Indian or Alaska Native1%0%
Asian1%1%
Black or African American10%9%
Not reported10%17%
Unknown1%0%
White75%72%
Stages
Not reported45%36%
Stage I1%1%
Stage IA12%11%
Stage IB13%17%
Stage II0%1%
Stage IIA6%9%
Stage IIB9%12%
Stage III0%1%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Provenzano, D.; Lichtenberger, J.P.; Goyal, S.; Rao, Y.J. Decoding Lung Cancer Radiogenomics: A Custom Clustering/Classification Methodology to Simultaneously Identify Important Imaging Features and Relevant Genes. Appl. Sci. 2025, 15, 4053. https://doi.org/10.3390/app15074053

AMA Style

Provenzano D, Lichtenberger JP, Goyal S, Rao YJ. Decoding Lung Cancer Radiogenomics: A Custom Clustering/Classification Methodology to Simultaneously Identify Important Imaging Features and Relevant Genes. Applied Sciences. 2025; 15(7):4053. https://doi.org/10.3390/app15074053

Chicago/Turabian Style

Provenzano, Destie, John P. Lichtenberger, Sharad Goyal, and Yuan James Rao. 2025. "Decoding Lung Cancer Radiogenomics: A Custom Clustering/Classification Methodology to Simultaneously Identify Important Imaging Features and Relevant Genes" Applied Sciences 15, no. 7: 4053. https://doi.org/10.3390/app15074053

APA Style

Provenzano, D., Lichtenberger, J. P., Goyal, S., & Rao, Y. J. (2025). Decoding Lung Cancer Radiogenomics: A Custom Clustering/Classification Methodology to Simultaneously Identify Important Imaging Features and Relevant Genes. Applied Sciences, 15(7), 4053. https://doi.org/10.3390/app15074053

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop