Artificial Intelligence on Diagnostic Aid of Leprosy: A Systematic Literature Review

Leprosy is a neglected tropical disease that can cause physical injury and mental disability. Diagnosis is primarily clinical, but can be inconclusive due to the absence of initial symptoms and similarity to other dermatological diseases. Artificial intelligence (AI) techniques have been used in dermatology, assisting clinical procedures and diagnostics. In particular, AI-supported solutions have been proposed in the literature to aid in the diagnosis of leprosy, and this Systematic Literature Review (SLR) aims to characterize the state of the art. This SLR followed the preferred reporting items for systematic reviews and meta-analyses (PRISMA) framework and was conducted in the following databases: ACM Digital Library, IEEE Digital Library, ISI Web of Science, Scopus, and PubMed. Potentially relevant research articles were retrieved. The researchers applied criteria to select the studies, assess their quality, and perform the data extraction process. Moreover, 1659 studies were retrieved, of which 21 were included in the review after selection. Most of the studies used images of skin lesions, classical machine learning algorithms, and multi-class classification tasks to develop models to diagnose dermatological diseases. Most of the reviewed articles did not target leprosy as the study’s primary objective but rather the classification of different skin diseases (among them, leprosy). Although AI-supported leprosy diagnosis is constantly evolving, research in this area is still in its early stage, then studies are required to make AI solutions mature enough to be transformed into clinical practice. Expanding research efforts on leprosy diagnosis, coupled with the advocacy of open science in leveraging AI for diagnostic support, can yield robust and influential outcomes.


Introduction
Neglected tropical diseases (NTDs) can compromise people's quality of life, leading to physical and psychological disabilities.Such diseases are caused by infectious agents or parasites and affect more than one billion people worldwide [1].They are more prevalent in populations in Latin America, Africa, and Asia [2,3], and considered endemic in 13 lowand middle-income countries [4].Leprosy is a NTD and considered one of the oldest diseases in human history [5].The infectious agent of the disease is the intracellular parasite Mycobacterium leprae (M.leprae) or Hansen's bacillus, which may affect the skin, peripheral nerves, eyes, endothelial cells, bones, mucous membranes, and may result in physical injuries and mental disabilities [6,7].Approximately 210,000 new cases are reported annually, with 15,000 cases identified in children [8].Leprosy is present in more than 150 countries, with 80% of cases concentrated in India, Brazil and Indonesia, and considered a public health problem [1].
Despite its high incidence in some regions, estimates show that only 5% of people exposed to the leprosy pathogen are actually infected, and only 20% of them develop the disease [9].Even people who do not develop the disease, because they have innate immunity, may experience a period in which the bacillus is released through the upper respiratory tract, which is the most common route of transmission [10,11].Diagnosing leprosy is a challenging task since the symptoms take from two months to 20 years to appear, and there is no gold standard test to diagnose it [12].The diagnosis is predominantly made by analyzing clinical and dermato-neurological signs, and complementary tests such as heat sensitivity test and Mitsuda intradermal reaction test, and serology may also be used [13].However, the absence of early symptomatology and similarity to other dermatological conditions can lead to an inconclusive diagnosis and, consequently, lack of appropriate treatment [14].
Artificial intelligence (AI) technology is a growing area and thanks to machine/deep learning (ML/DL), it has gained increasing prominence in medicine.ML/DL techniques encompass statistical models and algorithms capable of progressively learning from data, predicting features, and executing a task [15].In particular, DL systems can process complex, high-dimensional data such as images [16,17].In recent years, ML/DL applications have increased exponentially as a diagnostic aid in dermatology [18,19].Methods for the analysis and classification of dermatological lesions may involve steps such as image acquisition, pre-processing, segmentation, feature extraction and lesion classification [5,20].
In recent years, secondary studies have addressed the use of AI in the health area, especially in dermatology, given the need for recognition and analysis of images with high speed and accuracy [21].For example, Brinker et al. [22] reviewed studies focused on the development of skin lesion classifiers using Convolutional Neural Networks (CNNs).Popescu et al. [23] presented the advances in the detection of melanoma using artificial neural networks (ANNs).Wu et al. [24] provided an overview of the algorithms based on DL for skin cancer classification, while Kumar et al. [25] gathered data related to AI techniques for diagnosing various diseases, including skin diseases.Yu et al. [26] summarized a set of ML applications for psoriasis assessment and management.Different from the previous secondary studies, this Systematic Literature Review (SLR) aims to identify, analyze and characterize the state-of-the-art AI techniques for diagnostic aid of leprosy.
The remaining article is organized as follows.Section 2 addresses relevant concepts related to leprosy and its diagnosis methods, as well as concepts and application of AI in clinical medicine, for a better understanding of the review.Section 3 describes the research methodology.In Section 4, the selected studies are detailed to answer the research questions, while we discuss trends, open issues and limitations in Section 5. Finally, Section 6 concludes the review.

Leprosy
Leprosy, one of humanity's oldest diseases, remains a significant public health problem worldwide despite being treatable [7].This chronic infectious condition, caused by Mycobacterium leprae, can affect the cells lining blood and lymph vessels, sensory, motor and autonomic nerves, eyes, bones, and the upper respiratory tract [27,28].Exploring the epidemiology of this disease unveils the complex dynamics underlying its occurrence [9].In addition to the medical complexities, individuals diagnosed with leprosy endure social discrimination, face social exclusion, suffer from a diminished quality of life and often struggle with permanent disfigurement [29].
About the transmissibility, even people who do not develop the condition may experience a period in which the bacillus is released through the upper respiratory tract, which is the most common route of transmission [10,11].This transmission occurs through close and prolonged contact between a susceptible individual and an infected bacillus.In addition, less common transmission can also occur through skin erosion and vertical transmission.Thus, an infected bacillus carrier is essential in transmitting leprosy [30][31][32].The infected individual who develops the disease may present characteristic symptoms that can determine the type of classification and, consequently, the treatment.
According to the World Health Organization (WHO), leprosy can be categorized as paucibacillary (PB) or multibacillary (MB), depending on the individual's immune response to M. leprae.This classification is called Operational and guides the appropriate therapeutic regimen for the patient, designated for treatment purposes.In addition, it is based on the clinical appearance and bacterial index of the lesions.Individuals who have up to five skin lesions and negative intradermal smears are considered PB, and those with six or more skin lesions and positive intradermal smears are classified as MB [33,34].
M. leprae leads to loss of sensation, innervation, damage within the epidermis, and lesions, which are associated with loss of myelin in Schwann cells.In Brazil, one of the most endemic countries for the disease, the Madrid Classification (1953) [35] is used, which was later adapted by Ridley and Jopling (1966) and is widely used throughout the world.The Madrid classification also determines the type of leprosy according to the characteristics of the lesions, neural involvement and sensitivity.It subdivides them into indeterminate, tuberculoid, borderline and virchowian leprosy and is widely used for the differential diagnosis of leprosy [36,37].Ridley and Jopling [38] classified leprosy into five clinical forms based on clinical, histopathological, immunological, and bacilloscopic characteristics: tuberculoid-tuberculoid, borderline-tuberculoid, borderline-borderline, borderline-lepromatous, and lepromatous-lepromatous.

Diagnosis Methods of Leprosy
Approximately 70% of leprosy cases are clinically diagnosed through clinical and epidemiological history, anamnesis, dermatological and neurological evaluation [39].Clinical diagnosis is based on three cardinal signs: (1) definite loss of sensation in a hypopigmented or reddened skin patch; (2) peripheral nerve thickening, with loss of sensation and weakness of the muscles innervated by the affected nerve; and (3) microscopic detection of bacilli [1,40,41].However, 30% of patients do not have the typical characteristics of the disease, requiring additional tests, such as the Mitsuda reaction test, serological tests, and molecular biology tests [37,42].Specifically, on the clinical diagnosis performed by health professionals, the patient's dermatological and neurological signs and symptoms are evaluated.Skin lesions are identified when present, and a thermal sensitivity test is performed to assess sensitivity changes in the lesions [27,37].
The neurological field also needs to be assessed.Tests for analysis of irritation or itching in the eyes and bleeding or wounds in the nose, palpation of peripheral nerve trunks, and assessment of muscle strength and joint mobility in hands and feet are necessary to verify neural involvement [39,43,44].Hand and foot mobility may be assessed by the Graded Sensory Test [45].Neuropathic pain affects more than 60% of leprosy patients and is caused by primary damage to fine fibers, unmyelinated fibers or dysfunction of the nervous system, and can be assessed by electroneuromyography [46].
The clinical diagnosis is often insufficient, requiring additional tests, such as laboratory tests [37,41].Bacilloscopy, for example, allows the detection of alcohol-acid-resistant bacilli, such as Hansen's bacillus, and has a specificity of 100% and a sensitivity that varies between 34.4% and 50%.For its application, it is necessary to collect smears of intradermal scrapings from regions such as right and left earlobes, right and left elbows, and skin lesions [47,48].Another critical test is the histopathological one, which is performed using samples from the edges of more active lesions and nerves that can help diagnose cases with atypical clinical manifestations and then direct a more accurate treatment.The specificity of this exam is from 70% to 72%; on the other hand, its sensitivity is low, ranging from 49% to 70% [33, 48,49].
The Mitsuda intradermal reaction test is a skin reaction based on an individual response of high sensitivity and high specificity of the delayed cell type against the bacillus M. leprae.Fernández's intradermal reaction test, on the other hand, has an early reaction with low sensitivity, thus presenting a risk of cross-reactivity with other bacteria.Therefore, the Mitsuda test is the most widely used [37].Positive indicates that an individual's macrophages can destroy the bacillus M. leprae and, when negative, if exposed to M. leprae, he is at greater risk of becoming ill and developing the virchowian form of leprosy.In addition, the test helps to classify leprosy as undefined and borderline [50].
Another widely used test is serological, which is important for evaluating and quantifying the bacterial load of M. leprae.Phenolic glycolipid-I (PGL-1) is the major antigenic glycolipid of M. leprae and allows the detection of anti-PGL-1 immunoglobulin G (IgG) and immunoglobulin M (IgM) antibodies.The presence of IgM antibodies in response to PGL-I, which is present in the cell wall of M. leprae, helps to classify leprosy (low bacterial load for PB and high bacterial load for MB).Among the numerous methodologies, two are widely used: the enzyme-linked immunosorbent assay (ELISA) and ML-Flow, an alternative method to ELISA but with a lateral flow format [36].In addition, serology is essential to identify household contacts at higher risk of developing the disease, as well as in the follow-up of cases to assess the risk of relapse [36,51].
One of the most specific diagnostic methods is those based on molecular tests, which are based on species-specific sequences, deoxyribonucleic acid (DNA) and ribonucleic acid (RNA), such as the polymerase chain reaction (PCR) [47,52].This methodology presents high specificity and sensitivity due to the success in detecting Hansen's bacillus DNA, even when there are few [53][54][55].

Artificial Intelligence in Clinical Medicine
AI refers to the ability of intelligent agents to learn and solve problems in automated processes that impact the quality of life in society, from the automation of industrial processes and communication by smartphones to diagnostic help in medicine [21].Machine Learning (ML) is a subfield of AI, which uses autonomous algorithms and statistical models that identify patterns and have the potential to help diagnose diseases and other medical approaches [56].Deep learning (DL) is a sub-area of ML and uses the concept of artificial neuron layers for pattern extraction, and representation of complex and unstructured data [57,58].
Computer vision (CV), combined with IA models, enables the analysis of medical images, assisting in the diagnostic process and potentially reducing human errors [59].Some diagnostic models based on CV have shown significant evidence in improving the detection of diseases, such as in the early diagnosis aid of skin cancer [60].
In healthcare, AI can provide suggestions and recommendations that direct the decision-making process in clinical practice, facilitated by evaluation and testing, notwithstanding barriers, such as data availability and quality [61][62][63].Disease prediction models use AI techniques (i.e., ML/DL) associated with data mining approaches [64].Cardiology, pulmonary medicine, endocrinology, nephrology, gastroenterology, and neurology are some application areas of AI in the medical practice [65,66].
The diagnostic method involves detecting a disease or health condition by analyzing the individual's clinical signs and symptoms.AI models have been used to facilitate the diagnostic process, with the development of methods capable of analyzing, classifying, and predicting an outcome using a dataset related to the various existing pathologies, such as cancer, diabetes, dengue, malaria, tuberculosis [17], and mental health [67,68].
Moreover, improvements in the health area from AI have also reached the dermatological field [69].ML techniques are helpful in diagnosis through skin image analysis, and a future trend in this area regarding personalized treatment [21,70,71].Advances are not restricted to the analysis of melanomas and pigmented skin lesions.Other dermatological conditions are analyzed, such as psoriasis, acne, autoimmune disorders, and allergic contact dermatitis [20].Promising results have been demonstrated in the detection of monkeypox lesions using the MobileNetV2 architecture [72], in the early detection of skin cancer [15], in which CNNs show high accuracy in disease recognition [73].

Methodology
This SLR followed the preferred reporting items for systematic reviews and metaanalyses (PRISMA) framework [74] (see PRISMA checklist in Supplementary File S1).We addressed three distinct phases: (i) planning the conduction of the review by elaborating a review protocol; (ii) performing collaboratively the activities contained in the protocol using the online tool Parsif.al[75]; and (iii) extracting data from the selected articles, analyzing and synthesizing the relevant information on the research topic.We registered the review protocol on PROSPERO (registration number CRD42023400323).

Research Questions
The following Research Questions (RQs) were defined:

Search Strategy and Selection Criteria
The following digital libraries were used to search for primary studies: ACM Digital Library, IEEE Digital Library, Web of Science, Scopus, and PubMed.The search process, conducted by one researcher, occurred on 31 August 2023; it was verified by two other researchers.To retrieve relevant studies, we used two search terms combined with their synonyms to design the search string presented in Box 1.The following three control articles were selected to guide the searches in the digital libraries [12,14,76].The validation process of the search strings occurred in the databases, which demonstrated their ability to find studies suitable for this SLR, including the control articles."(("Artificial Intelligence" OR "Data Science" OR "Deep Learning" OR "Machine Learning" OR "Algorithm*" OR "Predict* Model*" OR "Big Data" OR "Transfer Learning" OR "Computer Vision" OR "Text Mining" OR "Dataset" OR "Support Vector Machine" OR "Artificial Neural Network" OR "Backpropagation Neural Network" OR "Convolutional Neural Network" OR "Neural network" OR "Pattern recognition" OR "Supervised Learning" OR "Generative Adversarial Network" OR "Feature Learning" OR "Meta Data" OR "Image Segmentation" OR "Image Classifiers" OR "Image Processing" OR "Fuzzy Logic" OR "Decision Tree" OR "Decision Support System" OR "Support Vector Regression" OR "Regression" OR "Bayesian" OR "K-nearest Neighbors" OR "K-means") AND ("Leprosy" OR "Hansen's Disease"))".
Inclusion and exclusion criteria were defined for selecting articles, as listed in Table 1.Initially, we retrieved documents and compared them to remove duplicate records.We screened articles for eligibility based on their title, abstract, and keywords.In a second moment, the researchers read and analyzed the full text of the screened studies to identify those suitable for the scope of this review.We then evaluated the selection process by applying Cohen's Kappa coefficient [77], which measures the level of agreement between researchers' analyses.In the end, when there was no consensus among the researchers, the other researchers (co-authors) held discussions to resolve selection conflicts.We finally performed the snowballing technique [78,79] to maximize results in the selection process.(IC4) Peer-reviewed articles.(EC4) Articles in a language other than English.

Quality Assessment
Two independent researchers evaluated the selected studies using a quality assessment tool adapted by Cabitza and Campagner [80] to evaluate the robustness of the methodology in medical machine learning studies and the ability to reproduce its findings qualitatively.The checklist contains 30 items, which are quality criteria (QC), organized into six phases: problem understanding, data understanding, data preparation, modeling, validation, and deployment.Each item represents a requirement and is associated with three possible options: adequately addressed (OK), sufficient but unlikely, minor revision required (mR), and inadequately addressed, major revision required (MR).The studies were individually classified on a trichotomous scale associated with the tool's options, with a score of 1 (OK), 0.5 (mR), and 0 (MR).The quality assessment score is calculated based on the sum of the scores assigned to the items.Two researchers who analyzed the studies to assign the scores resolved evaluation conflicts through discussions.When there was no agreement, a third researcher acted as a judge and resolved the conflicts.The QCs used are shown below in Table 2.

Item
Problem Understanding (QC1) Is the study population described, also in terms of inclusion/exclusion criteria?
(QC2) Is the study design described?
(QC3) Is the study setting described?(QC4) Is the source of data described?(QC5) Is the medical task reported??
(QC6) Is the data collection process described, also in terms of setting-specific data collection strategies?

Item Data Understanding
(QC7) Are the subject demographics described in terms of average age, age variability, gender breakdown, main comorbidities, ethnic group, socioeconomic status? (QC8) If the task is supervised, is the gold standard described?(QC9) In the case of tabular data, are the features described?

Item Data Preparation
(QC10) Is outlier detection and analysis performed and reported?
(QC11) Is missing-value management described?(QC12) Is feature pre-processing performed and described?(QC13) Is data imbalance analysis and adjustment performed and reported?(QC30) Is the system already adopted in daily practice?

Data Extraction
The researchers read each selected article to extract the necessary information to answer the research questions, characterize the studies, and outline opportunities for future work.Table 3 presents the items of the data extraction form and their respective research question.

Study Selection
Figure 1 presents the PRISMA flow diagram with the study selection process.We initially retrieved 1659 articles from digital libraries.We then identified and removed 355 duplicate studies.Among the remaining 1304 articles, 51 were selected by reading the title and abstract.Eighteen papers were eligible after a complete reading of the study.Cohen's Kappa test between researchers' analyses was Kappa = 0.84 (p < 0.001), considered "almost perfect agreement" [81].We resolved conflicts through discussions, resulting in 18 studies.The snowballing technique resulted in the addition of three studies meeting the selection criteria.A total of 21 studies were included in the review for qualitative analysis.

Study Characterization
Table 4 presents the data extracted from the selected articles to answer the RQs.No study classified the types of leprosy according to the Madrid classification or Ridley Jopling classification.Some papers have classified leprosy according to the operational classification, which is recommended by the WHO.Studies by [14,87,94] classified leprosy as either paucibacillary or multibacillary.Binary classification occurred in 43% of the selected articles [12,14,76,82,83,87,[94][95][96].The remaining 57% of the articles used multiclass classification tasks, in which the models classified different skin diseases and, among them, leprosy [84][85][86][88][89][90][91][92][93][97][98][99].Also, results revealed that 24% were classified as leprosy or not.Most papers do not prioritize leprosy in proposing an AI model to aid in diagnosing the condition.Therefore, the works are directed at classifying skin diseases, including leprosy.
The studies [14,76,87,94,96] utilized the outcomes of tests as input data to AI models, such as the RNA sequencing technique (RNA-Seq) and real-time reverse transcription polymerase chain reaction (RT-qPCR) used in molecular and cellular biology.The RNA-Seq technique extracts total RNA from a biological sample, converts it into complementary DNA (cDNA), and performs next-generation sequencing (NGS) [100].The RT-qPCR and RNA-Seq techniques are used to quantify gene expression [101].The study by Tió-Coma et al. [76] used the results of gene expression analyses from RNA-Seq and RT-qPCR as input to develop an AI model.The study by Pillai and Chouhan [96] analyzed the H37Rv strain to study the immunology and pathogenesis of tuberculosis.H37Rv is a strain of Mycobacterium tuberculosis [102] and share characteristics similar to M. leprae.
To create an AI model, the study by Gama et al. [94] used the data age, sex, treatment time, qPCR test result (M.leprae DNA level), IgG/IgM serology level, and sputum smear index.The IgG/IgM serology levels tell about the amount of IgG/IgM antibodies present in a person's blood sample and indicate whether the person has been exposed to a pathogen, virus, or bacteria [103].The sputum smear index is an indicator to assess the bacillary load of a Mycobacterium in a sputum sample [104].
Cytokines are a group of signaling molecules produced by the immune system, such as tumor necrosis factor (TNF), interferon-gamma (IFN-y), interleukin 4 (IL-4), and interleukin 10 (IL-10), and their presence can indicate a specific disease [105].Marçal et al. [14] used results from an in vitro assay model of the M. leprae antigen and measurements of the cytokines TNF, IFN-y, IL-4, and IL-10 as input to develop an AI model for the operational classification of leprosy.

Preprocessing Techniques (RQ3)
Preprocessing techniques to prepare the dataset may depend on the choice of data type and the algorithm or architecture used to train the model.Some authors used numerical data, so requiring the data to be normalized [12].Researchers who used datasets with images and ML classical algorithms in [86,89,92,93,[97][98][99] had to apply various image preparation techniques and feature extraction.The features most explored by the authors were related to texture and edges, with applications of spatial filters aiming to correct, smooth, or enhance specific regions.In addition, image compression techniques (e.g., YCbCr algorithm [97] and DCT [89,98]), segmentation-related techniques (e.g., binary mask [92,97], histogram [90,97], OTSU [86], global thresholding [93]), and image noise reduction techniques (e.g., median filter, smooth filter [92,97]) were explored by the studies.Jin et al. [90] used the ResNet-50 and VGG16 architectures and the HOG technique for feature extraction.Mondal et al. [85,91] used techniques for image normalization and augmentation; [83,84] used data augmentation.

Performance of the Models (RQ5)
The most common metrics used to measure the performance of AI models were found in the selected papers: accuracy, precision, sensitivity/recall, specificity, F1 score, and AUC.The review revealed that accuracy is the metric used by the authors in 90% of the studies.In addition, accuracy was the only metric to measure model performance in 52% of articles [14,82,85,89,90,93,[95][96][97][98][99].Accuracy can be misleading in multiclass classification tasks with imbalanced datasets, in which one class may have more samples than others.In such cases, other metrics, such as precision, recall, and F1 score, should be utilized together with accuracy for better clarity of performance [108].
Figure 2 shows the metrics, the data types, the algorithms/architectures, and the performance of the models developed by the selected studies.A study implementing an architecture with a binary classification task called LeprosyNet obtained the best performance when considering accuracy.The most used algorithm among the selected studies is the SVM.In addition, the research revealed that the best DL techniques that used multiclass classification tasks of leprosy against other dermatoses were the CNN MobileNet-V2 and DenseNet-121 architectures.The models developed from image datasets had an average accuracy in the classification of leprosy of 89.97%.In comparison, the models created from numerical datasets reached an average accuracy of 87.98%.

Study Quality
Evaluating the quality of the selected studies allowed us to qualitatively analyze the methodological rigor of the ML studies in the medical area, their contributions, and the reproducibility of the results (see detailed quality assessment per study in Supplementary File S2).In 11 articles [83,86,90,[92][93][94][95][96][97][98][99], the scores reached less than 40% of the tool score.Other studies attained results above 40% of the quality criteria [76,82,87], and the study in [12] achieved 63.3%, which was the best evaluation.Figure 3 depicts an overview of the quality assessment result per phase of the checklist.
The modeling phase was adequately addressed by all studies evaluated.In the problem-understanding phase, the studies in [12,14,76,82,85,[87][88][89][90]94,97] demonstrated satisfactory quality.In contrast, the remaining studies could have provided more robust information in the problem-understanding phase.In the phases of data understanding, data preparation, validation and deployment, the studies failed to address the quality criteria assessed.

Discussion
To the best of our knowledge, this is the first SLR to focus on leprosy diagnostic aid supported by AI techniques.Therefore, this work can help researchers in AI and health informatics by characterizing the studies regarding datasets, preprocessing techniques, AI algorithms/architectures, also comparing the performance of different ML/DL models.In this section, we identify trends and open issues in current research, which are opportunities for future research.Also, we acknowledge the limitations of this SLR.

Trends
We recognized several trends presented by the studies on the diagnostic aid of AIsupported leprosy.First, we identified a trend of using images in datasets (n = 16) for developing AI models for leprosy classification.We also recognized a trend to use classical supervised ML algorithms (n = 14), highlighting SVM (n = 8), RF (n = 5) and DT (n = 4) as the most used ones.We also recognize that most models are developed for multiclass classification tasks (n = 12), and the metric most used was accuracy (n = 19).
Figure 4 displays three bar charts with the number of papers published by year, categorized by the type of model (i.e., classical machine learning vs. deep learning), the type of task modeled (i.e., multiclass classification task vs. binary classification task), and the type of dataset (image data vs. numerical data).

Open Issues
Studies identified in our SLR present promising solutions for diagnostic aid of leprosy using AI techniques.However, we recognize that there are open issues to be addressed by further research.

Open Science
Open science promotes openness and accessibility of research results, including data and methods.Reproducibility is an essential component of open science, as it aims to ensure that the results of a study can be reproduced and validated by other researchers [80].Open science adopts and promotes the Findable, Accessible, Interoperable, Reusable (FAIR) principles, which are guidelines that aim to make scientific data auditable [109].In this regard, accountability in AI governance ensures that research is conducted ethically, transparently, and responsibly [110].
In most studies (n = 17) identified in our SLR, the dataset, code, and methods used to implement AI models were not shared in public repositories, which impacted the quality assessment of the studies (see Section 4.4).Thus, this open issue can (and should) be addressed, such as implementing AI models that can aid in leprosy diagnosis following the principles of open science by sharing work information in a public repository under a permissive license to undergo external validation.This is enabled by free online repository services, such as GitHub and Zenodo.

Data Fusion
Data fusion in AI refers to combining and integrating information from various data sources that may include different types of data, such as text, images, audio, and databases [111].Data fusion aims to leverage the complementary information from each data source to improve the accuracy, reliability, and understanding of the results obtained.The fusion process can involve data integration, alignment, aggregation techniques, and the application of ML algorithms to explore and extract knowledge from the combined data [112].
Barbiere et al. [12] combined skin lesion images with clinical data from leprosy patients to train disease classification models.An open issue is an in-depth exploration of the combination of different multimodal data and originated from different sources (e.g., personal information, clinical signs and symptoms, skin lesion images, and information on reactions to polychemotherapy) to implement AI models to contribute to the diagnosis of leprosy.

Differential Diagnostic
Several diseases have skin lesions with characteristics similar to leprosy, which can significantly increase the rate of false diagnosis, hence the stigma associated with the disease [113].The results revealed that proposed AI models classified leprosy according to different classification forms (e.g., paucibacillary vs. multibacillary; binary classification to identify the presence of leprosy; and leprosy classified against other skin diseases).Yet, a research gap to be addressed is to build AI models through image analysis to classify the skin lesions caused by leprosy according to its clinical forms (e.g., Madrid or Ridley and Jopling classifications), so facilitating the differential diagnosis; that is, the distinction between leprosy and other dermatological conditions that may present similar symptoms.

External Validation
External validation refers to testing a model's ability to make accurate and useful predictions on datasets not used during training, and then providing evidence of the model's generalizability.External validation can improve the reliability of models, allowing them to be applied safely in different populations and clinical environments [114][115][116].None of the studies identified in our review externally validate their developed models (see detailed quality assessment per study in Supplementary File S2).Therefore, this remains an open issue that needs attention in future studies to ensure the prediction models' reliability and utility.

Limitations of the SLR
This SLR has limitations to be acknowledged and considered when conducting future research.First, we did not review gray literature, so we did not include articles such as research reports, theses, dissertations, government reports, and tutorials.Consequently, future work may extend this SLR to consider gray literature on this research topic.Second, we searched for articles only in the leading digital libraries.Therefore, future work may also extend the search to additional databases.

Conclusions
The results of this SLR provided new insights into the literature related to AI techniques in aiding leprosy diagnosis.Key trends were identified, such as the prevalence of classical supervised ML algorithms, and that most models are developed for multiclass classification tasks and using dermatological images as a non-invasive technique.Most of the articles did not consider leprosy as the study's primary objective but rather the classification of different skin diseases and, among them, leprosy.In addition, most of the selected papers did not adhere to the open science principles, showing low quality regarding transparency, data sharing, and responsibility.Such findings highlight the need for more research on leprosy diagnosis and to promote open science in the application of

Box 1 .
Search string used for this SLR.

(
QC19) (classification models) Is the model calibration described?(QC20) Is the internal/internal-external model validation procedure described, (e.g., internal 10-fold CV, time-based cross-validation)?(QC21) Has the model been externally validated?(QC22) Are the main error-based metrics used?(QC23) Are some relevant errors described?Item Deployment (QC24) Is the target user indicated?(QC25) (Classification models) Is the utility of the model discussed?(QC26) Is information regarding model interpretability and explainability available?(QC27) Is there any discussion regarding model fairness, ethical concerns, or bias risks, (for a list of clinically relevant biases, refer to)? (QC28) Is any point made about the environmental sustainability of the model, the carbon footprint, of either the training phase or inference phase (use) of the model?(QC29) Is code and data shared with the community?

Figure 3 .
Figure 3. Quality assessment results for the six phases of the checklist [80].

Figure 4 .
Figure 4. Number of articles published by year, organized according to different characteristics.

Table 4 .
Data extracted from the selected articles.