Meta-Analysis and Systematic Review of the Application of Machine Learning Classifiers in Biomedical Applications of Infrared Thermography

Atypical body temperature values can be an indication of abnormal physiological processes associated with several health conditions. Infrared thermal (IRT) imaging is an innocuous imaging modality capable of capturing the natural thermal radiation emitted by the skin surface, which is connected to physiology-related pathological states. The implementation of artificial intelligence (AI) methods for interpretation of thermal data can be an interesting solution to supply a second opinion to physicians in a diagnostic/therapeutic assessment scenario. The aim of this work was to perform a systematic review and meta-analysis concerning different biomedical thermal applications in conjunction with machine learning strategies. The bibliographic search yielded 68 records for a qualitative synthesis and 34 for quantitative analysis. The results show potential for the implementation of IRT imaging with AI, but more work is needed to retrieve significant features and improve classification metrics.


Introduction
Atypical body temperature value can be an indication of an abnormal physiological process or health issue. Infrared thermal (IRT) imaging, also known as thermography, is an imaging modality capable of capturing the natural thermal radiation emitted by an object through the use of an IR camera [1]. In the case of the human body, this heat emission is mainly dependent on underlying skin structures pertaining to the vascular and nervous system [2]. This technique can be implemented in a static manner, by performing image capture without stressing the skin area under assessment, or dynamically, through the application in advance of a chemical, thermal or mechanical stimulus [3]. In either case, the international accepted guidelines should be followed to assure proper measurements and reproducibility of temperature readings [4,5]. Its innocuousness, combined with the ability to perform a fast analysis at a low price, have made it an interesting method to assist physicians in diagnosis and/or treatment monitoring of several pathologies, such as diabetic foot [6], rheumatoid arthritis [7], breast tumors [8] and skin cancer [9].
The acquired temperature matrixes, i.e., thermograms, require image processing strategies to retrieve temperature data that is meaningful for the condition under analysis. Still, the understanding of this information is a challenging and time-consuming task that can be eased through the implementation of artificial intelligence (AI) computational methods based on machine learning (ML) algorithms [10]. Over the last years, thermal information has been applied by researchers as input features for AI classifiers to give

Information Sources
The reference sources Scopus, PubMed, ISI Web of Science and IEEE Xplore were used for the bibliographic search. The syntax (TITLE-ABS-KEY (machine learning OR (machine classification) OR (artificial intelligence))) AND TITLE-ABS-KEY (thermography OR (infrared imaging) OR (thermal imaging))) AND TITLE-ABS-KEY (biomedical)), (

(machine learning[Title/Abstract] OR (machine classification[Title/ Abstract]) OR (artificial intelligence[Title/Abstract])) AND (thermography[Title/Abstract] OR (infrared imaging[Title/ Abstract]) OR (thermal imaging[Title/Abstract]) AND (biomedical[Title/Abstract]), TOPIC:
(machine learning OR (machine classification) OR (artificial intelligence)) AND TOPIC: (thermography OR (infrared imaging) OR (thermal imaging)) AND TOPIC: (biomedical) and ((((((("All Metadata":thermography) OR "All Metadata":infrared imaging) OR "All Metadata":thermal imaging) AND "All Metadata":machine learning) OR "All Metadata":machine classification) OR "All Metadata":artificial intelligence) AND "All Metadata":biomedical) was used, respectively. The database Google Scholar was also searched with the same keywords combination. To guarantee that the highest possible number of articles was found, the operator OR was employed to include articles that applied alternative terms and/or expressions to "thermography" and "machine learning". No time restriction was imposed, and publications were included from the first available date to March 2020. A duplicate removal was performed after the bibliographic search.

Eligibility Criteria and Screening
After article selection, the title and abstracts were analyzed to exclude publications that did not refer to the use of biomedical thermal data for ML applications. The first eligibility criterion excluded publications that applied infrared imaging in the near and/or medium IR section of the spectrum. Articles where artificial intelligence strategies were not used to the assessment of a given pathology and/or treatment were also removed. The third eligibility criterion kept only papers written in English. Reviews and opinion articles were eliminated, as well as publications where the full text was unavailable online and was still unavailable upon contact with authors-these constituted the fourth and fifth criterion, respectively. Lastly, papers that did not present any quantitative metrics of classification were also disregarded.

Results
The bibliographic research yielded 429 records after the removal of duplicates. The title and abstract screening excluded an additional 305 publications that did not mention the application of artificial intelligence algorithms to biomedical thermal data. From the remaining 124 articles, 27 were excluded due to the application of near and/or medium infrared spectroscopy and 13 due to the implementation of learners on tasks non-related to pathology appraisal. The third criterion discarded four articles, and nine reviews were removed based on criterion number four. Finally, three papers were not considered for this systematic review and meta-analysis, since no performance metrics were reported. The remaining 68 articles underwent a qualitative synthesis, being divided by assessed disease: breast cancer (39), skin neoplasm (3), diabetes disease (6) and other conditions (19). The latter included studies concerning pathologies mentioned in less than three publications. The quantitative synthesis was performed for 34 of the 68 encountered publications and individually for breast cancer (22 studies) and diabetic foot (three studies), since not all records reported the necessary metrics for the meta-analysis process and/or focused on the diagnosis of a given pathology. A flow-diagram describing the phases of this systematic-review and meta-analysis is presented in Figure 1 [13].

Results
The bibliographic research yielded 429 records after the removal of duplicates. The title and abstract screening excluded an additional 305 publications that did not mention the application of artificial intelligence algorithms to biomedical thermal data. From the remaining 124 articles, 27 were excluded due to the application of near and/or medium infrared spectroscopy and 13 due to the implementation of learners on tasks non-related to pathology appraisal. The third criterion discarded four articles, and nine reviews were removed based on criterion number four. Finally, three papers were not considered for this systematic review and meta-analysis, since no performance metrics were reported. The remaining 68 articles underwent a qualitative synthesis, being divided by assessed disease: breast cancer (39), skin neoplasm (3), diabetes disease (6) and other conditions (19). The latter included studies concerning pathologies mentioned in less than three publications. The quantitative synthesis was performed for 34 of the 68 encountered publications and individually for breast cancer (22 studies) and diabetic foot (three studies), since not all records reported the necessary metrics for the meta-analysis process and/or focused on the diagnosis of a given pathology. A flow-diagram describing the phases of this systematic-review and meta-analysis is presented in Figure 1 [13]. The information retrieved from the encountered studies is summarized in Appendix A.

Breast Cancer
Breast cancer diagnosis is the topic with the highest number of publications for the application of thermal imaging in conjunction with ML algorithms. Ng et al. presented the first studies using ANN to detect breast malignancies, but with low performance metrics of ACC, SN and SP (58.5%, 54% and 67%) [15,16]. These results were improved further along with a more elaborate approach focused on the use of biostatistical methods (SN = 81.2%, SP = 88.2%) [17]. The main author also collaborated in the work of Tan et al. where fuzzy rules were implemented for the construction of hidden layers in a neural network, The information retrieved from the encountered studies is summarized in Appendix A.

Breast Cancer
Breast cancer diagnosis is the topic with the highest number of publications for the application of thermal imaging in conjunction with ML algorithms. Ng et al. presented the first studies using ANN to detect breast malignancies, but with low performance metrics of ACC, SN and SP (58.5%, 54% and 67%) [15,16]. These results were improved further along with a more elaborate approach focused on the use of biostatistical methods (SN = 81.2%, SP = 88.2%) [17]. The main author also collaborated in the work of Tan et al. where fuzzy rules were implemented for the construction of hidden layers in a neural network, yielding good SN results (100%) in the proposed classification tasks [18]. The improvement of classification parameters when fuzzy-neural networks are selected in place of single neural networks was also confirmed by another author's research [19,20]. Schaefer et al. studied the number of ideal partitions in a fuzzy-rule based classification system for breast cancer, reaching ACC, SN and SP values of 97.95%, 93.10% and 99.15%, respectively, as the number of partitions increased [21,22]. A fuzzy model based on C-means clustering by Lashkari et al. showed accuracy values of 75% for the screening of suspicious breast areas, a lower performance when compared to the supervised AdaBoost algorithm developed in their previous work (88%) [23,24]. Other strategies to improve neural networks' performance in breast cancer classification refer to the extraction of features with wavelet transform [25], or to numerical simulations conducted on various breast tissue composition models [26].
The preferred learner for breast cancer detection studies is based on Support Vector Machines (SVM). Francis et al. studied the extraction of statistical and textural features in the curvelet domain, classifying 91% of instances correctly [27]. The use of Principle Component Analysis (PCA) in subsequent work to reduce the number of features, along with dynamic thermal image acquisition, led to lower values of ACC (83.3%) [28]. A similar approach, using only four clinically significant features, had previously achieved good performance metrics with a larger dataset (ACC = 88.10%, SN = 85.71%, SP = 90.48%) [29]. The Radial Basis Function (RBF) kernel appears to be the best option for the implementation of SVM in breast cancer research, as proven in [30][31][32][33], as well as the use of textural features retrieved from thermograms, all with high performance metrics. Gogoi et al. focused its work on the gathering of the best possible inputs that characterize healthy, benign and malignant cases through temperature and intensity analysis (ACC = 83.2%, SN = 85.5%, SP = 73.2%) [34]. SVM was the chosen learner after comparison with other AI algorithms [35]. Satish et al. tried a different strategy, using local energy features of wavelet sub-bands after temperature matrix normalization to classify normal and abnormal breast thermograms with ACC = 91%, SN = 87.23% and SP = 94.34% with a SVM-Gaussian [8].
Apart from neural networks and support vector machines, studies with equally high classification metrics can be found using AdaBoost [36], Bayesian classifiers [37] and k-NN [38,39] classifiers. With the goal of achieving the best possible outcome, some authors choose to test different AI learners, with SVM [40,41], k-NN [42,43], ANN [44,45] and Decision tree [46] algorithms being the best options among the tested ones.
The use of ensemble classifiers has been frequent in recent years. Most work of Krawczyk et al. is based on a pool of learners, which are as distinct as possible from each other, in order to reduce complexity and avoid redundant information during the classification stage [47][48][49]. A first approach used different feature inputs to train the different classifiers and address class imbalance, evolving later for the one-class classification strategy to increase learners' sensitivity to minority classes [47,48]. Another way to surpass data shortage is suggested by the author through feature space clustering, assigning each cluster to the most competent classifier from the ensemble [50]. This last method gave the best results with ACC = 90.02%, SN = 82.55% and SP = 91.89%.
Few implementations of ML non-related to breast cancer diagnosis were found. Zadeh et al. used fuzzy active contours to segment suspected breast tumor areas (ACC = 91.89%) and Saednia et al. developed a supervised ML algorithm based on Random Forest to assess dermatitis caused by radiation therapy, reporting thermal markers indicative of radiation-induced skin toxicity with an ACC of 87% [51].

Diabetes Disease
The combination of IR imaging with AI algorithms has proven its usefulness in diabetic foot detection. Hernandez-Contreras et al. retrieved features representative of 3D morphological patterns and position of the foot, which achieved ACC values of 94.33% with an ANN classifier [52]. High classification rates (91%) were attained more recently with a learner, also based on neural networks, fed with inputs collected after the analysis of surface temperature distribution of the foot [6]. Adam et al. decomposed images of left, right and bilateral foot to calculate entropy and texture features. The best performance was achieved by a k-NN learner (ACC = 93.16%, SN = 90.31%; SP = 98.04%) [53]. Good classification metrics for k-nearest neighbor were also presented by Vardasca et al. The authors first used steady-state thermal images and built a computational tool for automatic image processing and classification with a success rate of 92.5% [54]. Later on, the use of dynamically acquired thermograms was tested by the same research group, achieving ACC = 81.25%, SP = 80% and SN = 100% with k-NN [55]. Great results were also recently achieved by Cruz-Vega et al. [56] for the detection of changes in the plantar temperature of the foot. The authors were the first to apply a deep learning structure for diabetic foot assessment, retrieving ACC and SN values of 85.3% and 91.67%, respectively. Thirunavukkarasu et al. presented the only study pertaining to diabetes disease that did not report on the evaluation of diabetic foot [57]. Instead, thermal data and artificial intelligence were used as a prescreening tool for type II diabetes, through the analysis of tongue thermograms. The ANN classifier was the top learner with an ACC of 94.28%.

Skin Cancer
The encountered studies that focused on the use of ML with thermal data for skin cancer detection were conducted by the same group of authors. Magalhaes et al. first tried to distinguish benign from malignant lesions with static IR imaging, reaching a low ACC value of 60% with k-NN classifiers [58]. The overall results were slightly improved when dynamic thermal information was added to the feature input set [59]. Recently, the distinction of melanomas and nevi lesions was successfully performed, reaching ACC and SN values of 84.2% and 91.3% [9].

Quantitative Synthesis
The meta-analysis process required the number of True Positives (TP), False Negatives (FN), True Negatives (TN) and False Positives (FP) for each study. Thus, studies that did not report this information, nor supplied data sufficient to retrieve it, were not included in the quantitative synthesis.
The software R was used with the "meta" package to perform univariate analysis and retrieve sensitivity (SN), specificity (SP) and log of diagnostic odds ratio (DOR) forest plots and the "mada" package to plot the summary receiver operating characteristic (SROC) curve of all studies [78]. The plots were constructed with the information retrieved from all studies included in the quantitative analyses process, independently of the studied pathology. Thus, a better visual assessment of the distribution of the different metrics among studies is possible.
The test for heterogeneity retrieved a high chi-square (χ2) value for the analysis of all records, suggesting high heterogeneity between studies (Appendix B). The high values of I2 encountered also support this idea [79]. Thus, a random effects model was used to summarize the effects and different analyses that were made, separating studies focused on breast cancer and diabetic foot. Still, the values of χ2 and I2 remained high for breast cancer studies, while being low for diabetic foot ones (Appendix B).
The plots and the "mada" package to plot the summary receiver operating characteristic (SROC) curve of all studies [78]. The plots were constructed with the information retrieved from all studies included in the quantitative analyses process, independently of the studied pathology. Thus, a better visual assessment of the distribution of the different metrics among studies is possible.
The test for heterogeneity retrieved a high chi-square (χ2) value for the analysis of all records, suggesting high heterogeneity between studies (Appendix B). The high values of I2 encountered also support this idea [79]. Thus, a random effects model was used to summarize the effects and different analyses that were made, separating studies focused on breast cancer and diabetic foot. Still, the values of χ2 and I2 remained high for breast cancer studies, while being low for diabetic foot ones (Appendix B).
The values of sensitivity and specificity differed substantially among all studies with SN ranging in [0.

Discussion
The literature search yielded 68 studies concerning the application of biomedical thermal data with ML strategies. More than half of these were focused on breast cancer diagnosis, yet its usage as a primary screening method is not recommended [80].
There is a clear preference for the use of ANN (19 studies), SVM (17 studies) and k-NN (9 studies) classifiers to achieve the best possible classification metrics, with only some authors choosing ensemble approaches (9) (Appendix A). The large number of successful implementation strategies already published, and the availability of pre-written algorithms based on these learners in development and computing environment tools such as WEKA, Spyder and Matlab, could be a justification for this choice. However, in some

Discussion
The literature search yielded 68 studies concerning the application of biomedical thermal data with ML strategies. More than half of these were focused on breast cancer diagnosis, yet its usage as a primary screening method is not recommended [80].
There is a clear preference for the use of ANN (19 studies), SVM (17 studies) and k-NN (9 studies) classifiers to achieve the best possible classification metrics, with only some authors choosing ensemble approaches (9) (Appendix A). The large number of successful implementation strategies already published, and the availability of pre-written algorithms based on these learners in development and computing environment tools such as WEKA, Spyder and Matlab, could be a justification for this choice. However, in some cases, this leads to a poor or inexistent description of classification parameters, training/testing conditions and/or number/type of input features selected, hindering the reproducibility of such studies. The inexistence of staple classification parameters is also clear and hampers a thorough meta-analysis.
When looking for breast cancer results, maximum performance was achieved by [39], while [53] showed the best approach to aid diabetic foot diagnosis. Still, additional work can be done to better the current methodologies.
For future work, most authors emphasize the need for improved feature selection strategies to guarantee the inclusion of significant features, while keeping the number of classification inputs as low as possible, thus reducing processing time [9,15,17,18,29,37,39,41,53,63,65]. To improve classification metrics, IR data could be complemented with information collected from other imaging modalities and/or biological tests [9,26,33,64,70]. The availability of a larger data sample is also mentioned by several studies across the different pathologies, in order to perform more complete testing and ease the implementation of such methodologies in daily practices [9,16,25,26,31,44,46,[52][53][54]58,60,63,65,67]. Apart from the mentioned suggestions, the implementation of parameter optimization during the construction of the learner may yield better classification results, as well as the use of strategies to deal with potential class imbalance problems. In addition, the construction of user interfaces/dashboards designed to be utilized by health-care professionals would simplify the introduction of ML aiding tools in day-to-day clinical activities.
From the visual verification of the metastatistics results, it is possible to conclude that there is some heterogeneity between studies. The log DOR forest plot displays the highest statistical heterogeneity, followed by the sensitivity and specificity ones, with the latter showing a larger number of confidence intervals overlapping. This factor is also reflected in the SROC curve since slight scattering is visible. This variability may partially be caused by the comparison of studies that implement different learners for the classification of distinct pathologies. Nonetheless, there is a good distribution of studies on the top left corner of the SROC curve, showing a good balance between sensitivity measurements and false positive rate, with the majority of SN and SP values at the highest range of the scale (SN > 0.82, SP > 0.76). Thus, the interest of ML models for biomedical thermal applications is indicated. The low χ2 value attained for diabetic foot studies, with a high p-value, reveals that heterogeneity is insignificant (Appendix B). This fact, allied to the good sensitivity and specificity of the encountered estimators, leads to the conclusion that the use of biothermal data for diabetic foot assessment with artificial intelligence learners could be an interesting diagnostic aiding tool. Still, this inference should be taken with caution, due to the low number of studies included in the quantitative analysis. The similarity between the metastatistic results for all studies, and for breast cancer separately, favors the conclusion that thermographic data cannot be singularly used for breast tumor assessment. The heterogeneity visible in the SN, SP and log (DOR) forest plots, with the summary (DSL) being excluded by the error bars of some studies, is also an indication of this assumption. Nonetheless, clinical heterogeneity could also have an impact on the high statistical heterogeneity found, due to differences in population enrolled, disease severity and followed methodology.

Conclusions
IR thermal imaging is a valuable imaging modality to assess physiological changes with pathological causes, such as the ones encountered during neoplasm development and rheumatic diseases. The use of artificial intelligence models for thermogram interpretation has showed positive results, combining a fast analysis with efficient classification. Still, more work is needed to promote its usage on a daily basis in clinical scenarios. The search for more discriminant features and larger data sets are the most important topics to improve performance metrics. The use of IRT imaging with ML alone for breast cancer assessment should be avoided, and additional metastatic studies for diabetic foot diagnosis are of interest to strongly prove the usefulness of these tools. The combined use of IRT imaging and ML methods should also be further extended to other pathological conditions, such as skin cancer, in the hopes of easing detection and decreasing treatment costs.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A Table A1. Biomedical thermal applications and respective implemented base classifier, sample size, accuracy (ACC), sensitivity (SN) and specificity (SP), chronologically ordered from oldest to latest.

Year (Ref.) Biomedical Application
Best   Appendix C Figure A1. Forest plots for sensitivity for breast cancer studies.