Radiomics in Oncological PET Imaging: A Systematic Review—Part 1, Supradiaphragmatic Cancers

Radiomics is an upcoming field in nuclear oncology, both promising and technically challenging. To summarize the already undertaken work on supradiaphragmatic neoplasia and assess its quality, we performed a literature search in the PubMed database up to 18 February 2022. Inclusion criteria were: studies based on human data; at least one specified tumor type; supradiaphragmatic malignancy; performing radiomics on PET imaging. Exclusion criteria were: studies only based on phantom or animal data; technical articles without a clinically oriented question; fewer than 30 patients in the training cohort. A review database containing PMID, year of publication, cancer type, and quality criteria (number of patients, retrospective or prospective nature, independent validation cohort) was constructed. A total of 220 studies met the inclusion criteria. Among them, 119 (54.1%) studies included more than 100 patients, 21 studies (9.5%) were based on prospectively acquired data, and 91 (41.4%) used an independent validation set. Most studies focused on prognostic and treatment response objectives. Because the textural parameters and methods employed are very different from one article to another, it is complicated to aggregate and compare articles. New contributions and radiomics guidelines tend to help improving quality of the reported studies over the years.


Introduction
The strive for personalized medicine, particularly in the oncological field, has led to the need to consider an ever-increasing amount of data to propose the most appropriate treatment at the most appropriate timing for each patient, so as to increase survival outcomes. Some recently arisen scientific fields are based on the measurement of biological molecules in a high-throughput way and attempt to comprehensively understand the underlying biology of systems of interest at the highest resolution possible [1]. The rationale of these so-called "omics" disciplines is to generate a reliable prognostic or predictive model for a certain condition, which should be more accurate than already existing models that were constructed based on conventionally collected clinical data [1]. The first "omic" disciplines were born in the late 1980s and were represented by genomics, proteomics, and metabolomics. Their importance was soon so well acknowledged that an "Omics era" was recognized existing from the end of the twentieth century. Taking advantage of the digital transformation in healthcare, medical imaging also contributed to the edifice by proposing its own "omic" field, that went under the name of "radiomics". Radiomics corresponds to the extraction of a high number of quantitative features from medical images, beyond the dimensional, uptake, or volume parameters traditionally used in radiology and nuclear medicine. Their extraction is based on a rigorous processing chain where each step can greatly influence the result [2] and hinder the reproducibility of the findings. Interestingly, a recent study aimed to evaluate 77 oncology-related radiomics studies by proposing an objective measurement of radiomics research quality and concluded that the overall scientific quality and reporting of radiomics was insufficient [3]. It highlighted the frequent absence of a validation cohort and underlined that the most frequent limitations to reproducibility were represented by frequently missing data, insufficient reporting of study objectives, blind assessment, and sample size. Moreover, a major criticism was addressed to the low frequency in which demonstration of clinical utility was explained in those onco-radiomics articles.
The large amount of data generated by radiomics can be difficult to handle for traditional statistical approaches. Artificial intelligence (AI), with its ability to identify patterns within a massive dataset, is highly useful in this setting. This term covers several interrelated categories, including machine learning, which refers to all modeling and prediction applications based on training data (e.g., logistic regression), and deep learning, a subcategory of machine learning based on a neural network that is supposed to reproduce-on a smaller scale-the functioning of a human brain [4]. One of the key points of AI is the training base, which must be large enough to avoid overfitting issues [5], or, in other words, to avoid that the model becomes too attuned to the training data and loses its applicability to any other dataset.
As a result of the increasing number of articles on radiomics in oncological Positron Emission Tomography (PET) imaging, we here provide a systematic review of the literature, with a particular focus on assessing the quality of articles. In this first part, we will consider only supradiaphragmatic malignancies, other cancers will be treated in part 2.

Materials and Methods
This systematic review of published literature was performed according to the reporting standards of the PRISMA-P statement [6]. It was not registered.

Search Strategy, Inclusion and Exclusion Criteria
We performed a literature search in the PubMed database to identify all eligible articles using the following formula: ("PET" OR "positron") AND ("radiomics" OR "radiomic" OR "texture" OR "textural") Results were admitted from 1 January 1990 up to and including 18 February 2022. Reviews were automatically identified using the article type options and removed from the extracted database.
Inclusion criteria were (Table 1): (1) studies based on human data, (2) studies specifying at least one supradiaphragmatic tumor type, and (3) studies performing radiomics on PET imaging. Exclusion criteria were: (1) studies not related to medical topics, (2) reviews, posters, editorials, comments, cases reports, (3) duplicates, (4) studies outside the oncological field or radiomics not performed on PET, (5) studies only based on phantom or animal data, (6) technical articles (optimization, robustness), without a clinically oriented question, (7) studies including fewer than 30 patients in the training cohort (for studies including multiple types of cancers, each cancer type was considered separately), (8) not strictly supradiaphragmatic malignancy (e.g., esophagus), (9) studies not in English, and (10) full text not available (Table 1).

Quality Assessment
Studies were assessed for quality based on three items: 1.
The retrospective (score 0) or prospective (score 2) nature of the collection of data; 3.
The use of a completely independent cohort for validation: no or k-folding-as it can expose to data leakage-(score 0), partition of the cohort between completely separated training and test set (score 1), external validation cohort (score 2).
A simple quality score (QS), consisting in the sum of the 3 previously stated items, was calculated. A maximum possible score of 6 meant high quality study design of the article. Mean and 95% confidence intervals (CI) of the quality scores were calculated for all database articles divided by year of publication.
Results from articles with a QS strictly lower than 3 were not considered in the result section.

Textural Parameters Used
The number of textural parameters that can be extracted in an image is huge. They can be grouped into several categories, which were assessed in this review [7,8]: • Shape features: it is a purely geometric description of the segmented volume (metabolic tumoral volume, sphericity).

•
First order features (also called histogram-based features): these parameters are based on the value of each voxel included in the segmented region without taking into account their spatial inter-relationships (maximum, minimum, average, standard deviation, etc.). • Second order features: these parameters take into account the spatial interrelations between groups of pixels and are computed from texture matrices, calculated from the segmented region of interest [9]. Let us mention as an example Gray-level cooccurrence matrix (GLCM), which represents the frequency of occurrence of two intensity levels in neighboring voxels within a specific distance along a fixed direction or Gray-level run-length matrix (GLRLM), which encodes the size of homogenous runs for each image intensity. • Higher order features: higher-order statistics features are computed after the application of specific mathematical transformations or filters [8].

Data Collection and Review
An Excel review database was generated. The database was filled in completely three times (two readings one week apart by one author and a second reading by a second one). Any discrepancies were corrected by consensus. The following parameters were extracted from each article: • PMID, first author, year of publication; • Organ/type of cancer; • Quality data: number of patients, retrospective or prospective nature, validation, quality score; • Objective of the study; • Maximal order of textural parameter (shape only, first order, second order, higher order).

Discrepancies between the Two Reading Sessions
Eleven discrepancies between the two reading sessions of the database were encountered and led to a third reading: one duplicate was identified, two articles were misclassified regarding cancer subtype, eight discrepancies concerned patient number and presence of a validation cohort.

Searching Results
A total of 1180 studies were identified in the PubMed database, 239 of which were reviews and therefore automatically excluded. Of the remaining 941 studies, 537 more were excluded as 111 were off topic, 57 articles corresponded to undetected reviews or editorials, 7 were duplicates, 176 articles were not oncological or based on PET-radiomics, 27 were not human-based, 89 were technical articles, and 70 included fewer than 30 patients in the training cohort. A total of 404 articles were then sought for retrieval: 5 were not written in English, 17 articles had no full text available, and 162 studies dealt with nonsupradiaphragmatic malignancies and were therefore excluded. Finally, 220 studies were included in this review ( Figure 1). A study characteristics table is available in a separate file (Supplementary Table S1).

Quality Assessments
Mean quality score of the articles was 2.05/6, with a constant improvement over the years (from 0.80 in 2014, to 1.96 in 2018, to 2.33 in 2021), as displayed in Table 2. A total of 119 (54.1%) studies included more than 100 patients each, 21 studies (9.5%) were based on

Quality Assessments
Mean quality score of the articles was 2.05/6, with a constant improvement over the years (from 0.80 in 2014, to 1.96 in 2018, to 2.33 in 2021), as displayed in Table 2. A total of 119 (54.1%) studies included more than 100 patients each, 21 studies (9.5%) were based on prospectively acquired data, 91 (41.4%) articles described an independent validation set. The number of publications was found to increase each year ( Table 2).

Textural Parameters Used
The vast majority of the included studies (n = 189, 85.9%) used both first and secondorder textural parameters. Eight studies (3.6%) used only first-order parameters. Finally, 23 studies (10.5%) used higher order textural parameters. In particular, three studies describing more than 100 patients used radiomics to predict O6-methylguanine-DNA methyltransferase promoter methylation status [12], isocitrate dehydrogenase phenotype [15], and mutations in the telomerase reverse transcriptase promoter status [28], with encouraging results. As for differentiation between progression and radionecrosis, studies showed heterogeneous results both in terms of tracer and performance. Wang et al. [22] reported promising results with 11C-Metionine (160 patients-112 of which in the training cohort, AUC of 0.914), whereas Ahrari [31] noted a poor added value of PET radiomics using 18F-FDOPA PET in patients with high-grade glioma.

Head and Neck Cancer including Salivary Gland Cancers
A total of 48 articles met the inclusion criteria in the group of head and neck malignancies , 46/48 (95.8%) using 18F-FDG. Only 1 study out of 48 (2.1%) used 18F-FMISO [36] and 1/48 (2.1%) [42] used 18F-FLT. However, the conclusions of these two articles remain limited due to the small number of patients studied (between 30 and 35). Included articles had an average of 155.5 patients (range 30-707) with 24/48 (50.0%) studies including more than 100 patients and 24/48 (50.0%) using an independent validation cohort.
A vast majority of these articles (37/48, 77.1%) focused on prognostic issues by offering various and heterogeneous models. In a vast retrospective study on patients with nasopharyngeal carcinoma (470 patients in the training set and 237 in the test set), Peng et al. [61] used a deep-learning approach to predict disease-free survival with a C-index of 0.754 (95%CI: 0.709-0.800) and potentially guide the induction chemotherapy. Notably, the only negative study was conducted by Ger et al. on a large monocentric population of 686 head and neck cancer patients, with the conclusion that radiomic features were not consistently associated with survival, neither in CT or PET images and even within patients undergoing the same imaging protocol [68].
Other articles focused on indirect prognosis factors such as human papillomavirus (HPV) positivity, as HPV-positive cancers have longer overall survival than HPV-negative ones [50]. In particular, Haider et al. tried to correlate radiomics findings with HPVpositivity in oropharyngeal squamous cell carcinoma: using 435 primary tumors (326 of which for training and the other 109 for validation), his model achieved an AUC of 0.83 [50].
One paper studied radiomics as a potential tool to assess node status: Chen et al. [39] used a model associating deep learning and radiomics to classify lymph nodes as normal, suspicious or involved, with a reported accuracy of 0.88.
For salivary gland cancers, radiomics was also studied for local control and overall survival, with good preliminary results [80][81][82]. However, the QS were low, ranging from 0 to 2. Radiomics was used for staging purposes by Zheng et al. [154], who used a radiomicsbased model to predict mediastinal lymph node metastases in a population of 716 patients (501 of which included in the training cohort and the other 215 in the testing set): in the testing cohort, performances of the radiomics approach were significantly higher than clinical node staging (p = 0.037).

Lung Cancer
On the prognostic side, most studies focused on local control and treatment response prediction. Among them, a study on prospectively acquired data conducted by Mattonen et al. [88] (training: n = 145 patients; validation: n = 146 patients) was used to build a model that predicted recurrence/progression in non-small cell lung cancer (concordance of 0.74 in the testing set). A few studies have focused on the prediction of treatment side effects, especially radiation pneumonitis [93,139]. In the study conducted by Cui et al. [139], an externally validated deep learning model outperformed traditional normal tissue complication probability models in a multi-omics actuarial neural network architecture for prediction of radiation pneumonitis of grade 2 or higher.

Breast Cancer
We screened 32 publications on breast cancer , all employing 18F-FDG. The average number of enrolled patients was 126.8 (range 35-435), 18/32 (56.3%) studies including more than 100 patients; 3 studies (9.3%) were based on prospectively acquired data and 6/32 used an internal independent validation cohort (18.8%). Part of these studies was dedicated to investigating the correlation between texture parameters and immunohistochemical subtypes of breast cancer, in particular Liu et al. [210]. However, in a prospective study including 171 patients, Groheux et al. [202] did not find a high discriminative power for PET-derived texture metrics.
Other studies [190,214] have investigated the added value of texture parameters for predicting response to neoadjuvant chemotherapy and suggest a trend toward improved prediction models. In the largest study included for breast cancer, Lee et al. [214] concluded that the predictive power of a model incorporating both clinicopathological and texture factors was significantly higher than that of a model with clinicopathological factors only (AUC 0.80 vs. 0.73 p = 0.007).
Two retrospective studies were available for thymic epithelial tumors and 18F-FDG PET/CT [228,229], both including fewer than 50 patients and without validation cohorts.

Quality Assessment and Textural Parameters Used
In this work, we considered 220 publications related to radiomics in supradiaphragmatic malignancies. Our composite score for the evaluation of the quality of the publications was low, estimated at 2.05/6, in good agreement with a previous work reporting low quality of radiomics publications [3]. Almost half of the publications had fewer than 100 patients, a number often cited as a threshold to avoid overfitting [230]. About 60% of the studies did not use an independent data set for model validation. This phenomenon, although explained by the difficulty of collecting data, limits the generalizability of the conclusions, as radiomics is dependent on acquisition protocols [2]. Although there are reserves about the quality of the work in previous publications, we observe an improvement over the years on our composite criterion combining the number of patients, the presence of a validation cohort, and the presence of prospective data. This improvement is probably due to the publication of guidelines and dedicated checklists to ensure proper methodology (e.g., the Image Biomarker Standardization Initiative [231] and the Radiomics Quality Score Checklist [232]).
In this review, most papers consider textural parameters up to second order. The consideration of higher order parameters is less frequently encountered (about 10%).

Trends and Topics
The number of studies on radiomics is exponentially increasing, relying on both machine learning and deep learning approaches. The most studied supradiaphragmatic cancers are, in order of frequency, lung cancer, head and neck cancers, and breast cancers. The majority of the studies here described focus on prognostic and treatment response objectives. 18F-FDG remains the most studied tracer, as expected due to its wide clinical use. About brain tumors, PET radiomics and AI analysis could lead to a gain in more specific information on diagnosis and prognosis. Brain cancer patients usually have poor prognosis and new therapeutic strategies are needed to improve their outcomes [27]. PET radiomics and AI could help in the diagnosis of brain diseases with non-invasive methods and in the stratification of more aggressive histology at baseline, helping personalized and precision medicine in these conditions. Aminoacidic tracers [22,27] (e.g., 18F-DOPA, 11C-Methionine) and new targeted tracers could also increase the specificity of PET radiomics in certain diseases.
About lung tumors, PET radiomics and AI analysis have been widely evaluated in several settings. The number and the quality of available studies in the literature could help introducing these advanced systems in clinical practice. Probably, new studies should be focused on external validation cohorts to clearly assess the clinical usefulness of PET radiomics and AI in lung cancer, such as in distinguishing lung metastases and primary tumors with different histology and prognosis.
Similarly, for head and neck tumors, PET radiomics and AI analysis should spread in a clinical routine evaluation to definitively allow personalized and precision medicine for these patients. Some limitations emerged in 18F-FDG PET/CT applications in head and neck tumors for not being able to distinguish between inflammatory activation in some tissues and the localization of the tumor [47]. PET radiomics and AI analysis could help physicians to overcome these limitations.
In breast cancer, PET/CT has a great value for staging and restaging purposes [191]. The limited avidity of 18F-FDG in some primary breast tumors could be a limitation in the application of PET radiomics and AI analysis. Nevertheless, new tracers such as 18F-FAPI and monoclonal antibodies could be used in the near future to study the textural heterogeneity in primary breast tumors, thanks to PET radiomics and AI analysis.
18F-FDG PET/CT still has limited indications in thyroid cancer [226], mainly for restaging of differentiated carcinomas and staging of anaplastic carcinomas. Therefore, PET radiomics and AI analysis should be further evaluated.
Given the increasing number of immunotherapies in metastatic cancer from different primary tumors, PET radiomics and AI analysis may be considered to better evaluate cases of stable disease or pseudo-progression due to inflammatory reaction to immunomodulators, such as in lung cancer or breast cancer.

Limitations
Our review has a certain number of limitations: first, we set an arbitrary threshold of 30 patients to eliminate studies that were too exposed to an overfitting bias. One of the disadvantages of this selection is the potential elimination of rare pathologies from this review, as previously reported [233].
The scale used to assess the quality of the articles was practical but rather simplistic. This score made it possible to evaluate a large number of articles with a fairly high reproducibility (11 discrepancies between the reading sessions) at the expense of a thorough analysis of the methods.
Finally, and because the textural parameters and methods used are very different from one article to another, even for similar subjects or cancers it was challenging to aggregate and compare the articles between them.

Conclusions
Radiomics and AI are upcoming fields in nuclear oncology, especially in brain, head and neck, thyroid, and breast cancers. Although technically demanding, new contributions with robust validation cohorts and guidelines for clinical practice applications will surely continue helping to improve the quality and the impact of the reported studies over the years.