Artificial Intelligence in Periodontology: A Scoping Review

Artificial intelligence (AI) is the development of computer systems whereby machines can mimic human actions. This is increasingly used as an assistive tool to help clinicians diagnose and treat diseases. Periodontitis is one of the most common diseases worldwide, causing the destruction and loss of the supporting tissues of the teeth. This study aims to assess current literature describing the effect AI has on the diagnosis and epidemiology of this disease. Extensive searches were performed in April 2022, including studies where AI was employed as the independent variable in the assessment, diagnosis, or treatment of patients with periodontitis. A total of 401 articles were identified for abstract screening after duplicates were removed. In total, 293 texts were excluded, leaving 108 for full-text assessment with 50 included for final synthesis. A broad selection of articles was included, with the majority using visual imaging as the input data field, where the mean number of utilised images was 1666 (median 499). There has been a marked increase in the number of studies published in this field over the last decade. However, reporting outcomes remains heterogeneous because of the variety of statistical tests available for analysis. Efforts should be made to standardise methodologies and reporting in order to ensure that meaningful comparisons can be drawn.


Introduction
Artificial intelligence (AI) aims to develop computer systems that can mimic human behaviour using machines. Within medicine and dentistry, commentators predicted as early as the 1970s that AI would bring clinical careers to an end [1]; however, this has not been the case. Science fiction will present AI as a comprehensive overarching intelligence [2], but this is far from the truth. Thus far, AI development has proved successful in solving problems in specific areas by learning distinct thinking mechanisms and perceptions.
Periodontitis is the sixth most prevalent disease worldwide. It is characterised by microbially associated, host-mediated inflammation that results in loss of alveolar bone and periodontal attachment, which can lead to tooth loss [3]. This disease has a wellreported but complex relationship with a number of other physiological systems leading to detrimental effects on quality of life and general health [4]. Further to this, a bi-directional relationship between systemic conditions, including chronic inflammatory disease such as diabetes [5,6] and atherosclerosis [7], has been shown.
Periodontitis is also challenging for clinicians to accurately recognise and diagnose [8]. Best practice currently focuses on measuring soft tissues with a graduated probe [9] and assessing hard tissues with radiographic imaging [10]. However, these methods have poor inter-and intra-operator reliability due to variations in probing pressure and radiographic angulation [8].
As such, the study of periodontitis presents a diagnostic challenge linked to a disease process of complex relationships between predisposing factors that are difficult for clinicians and scientific processes to fully comprehend. These complex factors lend the study of this Dent. J. 2023, 11, 43 2 of 25 disease to the application of AI to best comprehend how these factors affect the diagnostics or understanding of its aetiology.
It is important to differentiate AI from traditional software development. In the traditional approach to software development, the researchers identify a series of processing steps and, optionally, a data-dependent strategy to reach the results. This is best described as an input 'A' is received, it is computed through the pre-defined strategy of sub-tasks, and an output 'B' is returned. As such, whilst this performs incredibly useful tasks for humankind, it requires vast amounts of effort to perform complex tasks and risks providing only limited availability for adaptation to unseen scenarios. Artificial intelligence, in contrast, has a different mode of work. When developing an AI-based tool, both the input 'A' and the required output 'B' are provided; the AI approach will then tune the tool to leverage the link between inputs and outputs, which can then be used on new (unseen) data sets, typically with remarkable performance [11].
AI is ever-increasing in medicine and dentistry as an assistive tool, becoming a central tenet in providing safe and effective healthcare. More recently, deep learning (DL) has been the mainstay of this endeavour, mainly through its applications stemming from the use of artificial neural networks (ANN) that exhibit a very high degree of complexity [12], where large numbers of artificial neurons (or nodes) are connected into layers and several hundreds, or thousands, of layers are assembled into specific structures called architectures. DL networks can assess large volumes of data to perform specific tasks, among which electronic health records, imaging data, wearable-device sensor collections, and DNA sequencing play a prominent role. Within medical fields, these are classically used for computer-aided diagnosis, personalised treatments, genomic analysis, treatment response assessment.
When images are used as input data, we as humans perceive digital images as analogues or as a continual flow of information. Still, a digital (planar) image is nothing more than a collection of millions of tiny points of colour or pixels, each with their 2D location. Thus, these pixel series can be viewed as strings of values with additional information about their neighbouring locations that software can process efficiently.
A subset of ANNs, a convolutional neural network (CNN), is specifically designed for handling imaging data. The CNN concept was developed to replicate the visual cortex and differentiate patterns in an image [13]. Classic neural networks typically need to consider each pixel individually to process an image and therefore are heavily constrained in the size of the images that can be analysed. CNNs, on the other hand, are capable of working with the image data in their spatial layout; their output is a new set of data replicating the original layout of the image while increasing or condensing the information stored at each location. This process is similar to applying several digital filters to an image to 'highlight' key features that will collectively help perform the task at hand, e.g., select distinct aspects of an object to identify its presence inside the image.
CNN-based architectures will often have multiple layers, or multiple levels at which these transformations are applied. Early layers will focus on picking up gross content such as edges, gradient orientation, and colour, with later layers focusing on higher-level (more task specific) features. This kind of approach is usually called an encoder because the iconic information inside the image is transformed into a more abstract, symbolic representation, and this is achieved by juxtaposing CNN-based blocks that progressively reduce the size of the image being processed while, concurrently, increasing the number of the channels, i.e., the number of values associated with each image pixel. The complementary approach to an encoder is a decoder where the abstract information is transformed into an iconic representation by successively increasing the image size while reducing the channel number. A common pattern in ANN architectures based on CNN layers is to have an encoder section, a decoder section, or both; for instance, the U-Net architecture [14], which is one of the most used approaches when segmenting imaging data, is structured as an encoder section followed by a decoder section to achieve a transformation of the image information from iconic to abstract and then back to iconic while performing the task at hand. Therefore, CNNs can be taught to recognise specific collections of pixel/location pairs and subsequently find similar patterns in new image datasets. For CNNs to perform this function, they require a so called 'training' stage. Training is a process whereby humans identify target subsets and show a CNN what to look for. In image analysis, this often relies on experts labelling image sections of interest so that CNN can find similar regions in the future. For example, radiologists would draw around these nodules for a CNN to recognise nodules on CT images of a lung to show their correct extent. As the CNN is shown more nodules, it will become more capable of identifying similar regions. This process, leading to the software's ability to carry out the nodule localisation independently, is called supervised learning [14].
Whilst in some cases CNNs can be definitive in their image recognition, they are more commonly used as assistive tools, whereby AI can highlight the areas that are likely to contain the sought pathology or image type. This has been demonstrated in radiology (for detecting abnormalities within chest X-rays) [15], dermatology (to detect lesions of oncological potential) [16], and ophthalmology (to detect specific types of retinopathy) [17]. It has been suggested, however, that within these fields, some of the work may lack the robustness to be truly generalisable to all clinical situations or, indeed, to be as accurate as medical professionals [18].
Within dentistry, radiographs are combined with a full clinical examination and special tests to aid in assessment, diagnosis, and treatment planning. The type of radiograph taken depends upon the disease or pathology being investigated or the procedure being undertaken, and may include bitewing, periapical, or orthopantomography [19]. Common justifications for taking dental radiographs include the diagnosis of dental caries staging and grading of periodontal disease [20], detection of apical pathology [21], or in the assessment of peri-implant health. A dentist will then report on these images. Still, it has been shown that in detecting both dental caries and periodontal bone loss, inter-ratee and intra-rater agreement is poor [22,23], lending this analysis to CNN assistance.
Image-based diagnostics is not the only area in which CNNs are currently used within medicine and dentistry. AI's ability to assess large volumes of data for regression purposes lends itself to data analysis within larger data fields where traditional methods struggle. In medicine, success has been achieved using patient health records and metabolite data to predict Alzheimer's disease, depression, sepsis, and dementia [11]. As the data processing technology has improved, specific to medicine and dentistry, a term introduced by the American Medical Association is that of 'augmented intelligence'. This describes a conceptualisation of AI in healthcare, highlighting its assistive role to medical professionals.
Whilst the benefits of utilising AI within healthcare can be clear to see, for example reducing human error, assisting in diagnosis, and streamlining data analysis and task performance, which may ultimately lead to more efficient and cost-effective services, its adoption is not without issue. Specifically, these can include a lack of data curation, hardware, code sharing, and readability [24], as well as the inherent issues of introducing any new technology within a service such as reluctance to change, embedding new technology within current infrastructures, and ongoing cost maintenance. More recent criticism addressed the presence of implicit biases in the training datasets and the direct consequences in AI performance [25].
Within periodontology and implantology, AI is still in its relative infancy and has not yet been used to its full potential. With the advantages of diagnostic assistance, data analysis, and detailed regression, it would appear that much could be gained through applying this tool. Given the relative paucity of literature in the subject area, this scoping review aimed to assess the current evidence on the use of artificial intelligence within the field of periodontics and implant dentistry. This would both describe current practice and guide further research in this field.

Materials and Methods
This prospective scoping review was conducted by considering the original guidance of Arksey et al. [26] and more recent guidance from Munn et al. [27].

Focused Question and Study Eligibility
The focused question used for the current literature search was "What are the current clinical applications of machine learning and/or artificial intelligence in the field of periodontology and implantology?" The secondary questions were as follows: 1.
Which methods were used in these studies to establish datasets; develop, train, and test the model; and report on its performance? 2.
In cases where these models were tested against human performance: a.
What metrics were used to compare performance? b.
What were the outcomes?
The inclusion criteria for the studies: 1.
Original articles published in English.

2.
Implant-and periodontal-based literature using ML or AI models for diagnostic purposes, detection of abnormalities/pathologies, patient group analysis, or planning of surgical procedures.

3.
Study designs whereby the use of ML or AI was used as the independent variable.
The exclusion criteria for the studies: 1.
Studies not in English.

2.
Studies using classic software rather than CNN derivative protocols for machinebased learning.

3.
Studies using AI for purposes other than periodontology and peri-implant health.

Study Search Strategy and Process
An electronic search was performed via the following databases: • Medline-the most widely used medical database for publishing journal articles. The search strategy for this is outlined in Figure 1. • Scopus-the largest database of scientific journals. • CINAHL-an index that focuses on allied health literature. • IEEE Xplore-a digital library that includes journal articles, technical standards, conference proceedings, and related materials on computer science. • arXiv-arXiv is an open-access repository of electronic preprints and postprints approved for posting after moderation but not peer review. • Google Scholar-Google Scholar is a freely accessible web search engine that indexes the full text or metadata of scholarly literature across an array of publishing formats.
This electronic search was supplemented with hand searches of included texts references lists. The search strategy was compiled in collaboration with the librarian at the University of Sheffield Medical Library. Keywords were a combination of Medical Subject Headings (MeSH) terms and frank descriptors, which were employed to reflect the intricacies of each database. The publication period was set to 20 years and was restricted to literature in English. All articles were included for initial review in line with the scoping review methodology. Records were collated in reference manager software (Endnote TM ; Version: 20, Clarivate Analytics, New York, NY, USA) [28], and the titles were screened for duplicates. This electronic search was supplemented with hand searches of included texts references lists. The search strategy was compiled in collaboration with the librarian at the University of Sheffield Medical Library. Keywords were a combination of Medical Subject Headings (MeSH) terms and frank descriptors, which were employed to reflect the intricacies of each database. The publication period was set to 20 years and was restricted to literature in English. All articles were included for initial review in line with the scoping review methodology. Records were collated in reference manager software (Endnote TM ; Version: 20, Clarivate Analytics, New York, NY, USA) [28], and the titles were screened for duplicates.

Study Selection
A single reviewer screened titles and abstracts. For records appearing to meet inclusion criteria, or where there was uncertainty, full texts were reviewed to determine their eligibility. This was completed twice by the reviewer at a 2-month interval to assess intrarater agreement. Additional manual hand searching was performed of included full-text articles with reference lists from these studies included. These selected full texts were similarly read, and suitability for inclusion was as per the original criteria.

Data Extraction and Outcome of Interest
Data was extracted from the studies and recorded in a tabulated form. The standardised data collation sheet included the author title, year of publication, data format, application of ML/AI technique, the workflow of the ML/AI model, the subsequent training/testing datasets, the validation technique, the form of comparison used, and then some description of the performance of the AI model. The primary outcome of interest was the scope of current clinical applications of ML/AI in the field of periodontology and peri-implant health and the performance of these AI models in clinician or patient

Study Selection
A single reviewer screened titles and abstracts. For records appearing to meet inclusion criteria, or where there was uncertainty, full texts were reviewed to determine their eligibility. This was completed twice by the reviewer at a 2-month interval to assess intra-rater agreement. Additional manual hand searching was performed of included full-text articles with reference lists from these studies included. These selected full texts were similarly read, and suitability for inclusion was as per the original criteria.

Data Extraction and Outcome of Interest
Data was extracted from the studies and recorded in a tabulated form. The standardised data collation sheet included the author title, year of publication, data format, application of ML/AI technique, the workflow of the ML/AI model, the subsequent training/testing datasets, the validation technique, the form of comparison used, and then some description of the performance of the AI model. The primary outcome of interest was the scope of current clinical applications of ML/AI in the field of periodontology and peri-implant health and the performance of these AI models in clinician or patient assistance. As this was a scoping review, all texts meeting eligibility criteria were subject to qualitative review.

Study Selection and Data Compilation
The study selection process is outlined in Figure 2. A total of 401 articles were identified for screening after the removal of duplicates. In total, 293 texts were excluded after screening for factors not meeting inclusion criteria. Full-text review and hand searching identified 50 studies for inclusion in the qualitative analysis. Most of these excluded articles tested software rather than an AI architecture to assess the input data.

Study Selection and Data Compilation
The study selection process is outlined in Figure 2. A total of 401 articles were identified for screening after the removal of duplicates. In total, 293 texts were excluded after screening for factors not meeting inclusion criteria. Full-text review and hand searching identified 50 studies for inclusion in the qualitative analysis. Most of these excluded articles tested software rather than an AI architecture to assess the input data.  The included articles were appraised with key information compiled into a single data table (Table 1) for display in text format.

Location of Research
The first authors were from a wide variety of locations, illustrated in Figure 3. However, over half were from institutions in the USA (n = 12), China (n = 11), or South Korea (n = 7). The remainder were from Europe (n = 6), the Middle East (n = 4), the Indo-Pacific (n = 4), Brazil (n = 3), Turkey (n = 2), and Canada (n = 1). There is also strong evidence of international collaboration, with the first and last authors' locations being geographically disparate in a number of cases (n = 8).

Location of Research
The first authors were from a wide variety of locations, illustrated in Figure 3. However, over half were from institutions in the USA (n = 12), China (n = 11), or South Korea (n = 7). The remainder were from Europe (n = 6), the Middle East (n = 4), the Indo-Pacific (n = 4), Brazil (n = 3), Turkey (n = 2), and Canada (n = 1). There is also strong evidence of international collaboration, with the first and last authors' locations being geographically disparate in a number of cases (n = 8).  Figure 4 depicts the year that studies were published. The first was in 2014, with a steady increase in the frequency of publication over the next decade. This number appears to be stabilising at 14-15 per year post-2020, in line with six found in 2022 prior to April when the searches were conducted.   Figure 4 depicts the year that studies were published. The first was in 2014, with a steady increase in the frequency of publication over the next decade. This number appears to be stabilising at 14-15 per year post-2020, in line with six found in 2022 prior to April when the searches were conducted.

Input Data
All included studies had a periodontal focus; however, the input data varied significantly ( Figure 5). The majority (68%) of studies focused on imaging data, using either photographs (n = 12), radiographs (n = 20), or ultrasound images (n = 2). Patient data (Electronic Health Record) were used in to attempt to predict periodontal or dental outcomes (n = 7). Metabolites and saliva markers were used to classify, diagnose, and predict periodontal and dental outcomes (n = 7).

Datasets
The dataset size had a large span due to the variability of inputs and outcomes of the study types. Datasets for image processing studies ranged from 30 to 12,179, with a mean of 1774 for solely panoramic radiographs, 1064 for solely periapical radiographs, and 1431 for photographs. Due to the novel nature of ultrasound imaging, the two included studies contained less data (n = 35 and 627). Patient datasets varied between 216 and 41,543 with no meaningful descriptive statistics due to outcome and methodological heterogeneity. For all imaging studies, a mean of 1666 images were used.
However, it is worth noting that the number of images included in studies with relatively heterogenous image types (i.e., plain film radiography) was not distributed in a Gaussian or normally distributed curve, with the median being significantly different to the mean. This was demonstrated with median (585 images) divergent from the mean (1940 images). Figure 6 shows the relative frequency peaks for visual analysis. Dent. J. 2023, 11, x FOR PEER REVIEW 8 of 27

Input Data
All included studies had a periodontal focus; however, the input data varied significantly ( Figure 5). The majority (68%) of studies focused on imaging data, using either photographs (n = 12), radiographs (n = 20), or ultrasound images (n = 2). Patient data (Electronic Health Record) were used in to attempt to predict periodontal or dental outcomes (n = 7). Metabolites and saliva markers were used to classify, diagnose, and predict periodontal and dental outcomes (n = 7).

Input Data
All included studies had a periodontal focus; however, the input data varied significantly ( Figure 5). The majority (68%) of studies focused on imaging data, using either photographs (n = 12), radiographs (n = 20), or ultrasound images (n = 2). Patient data (Electronic Health Record) were used in to attempt to predict periodontal or dental outcomes (n = 7). Metabolites and saliva markers were used to classify, diagnose, and predict periodontal and dental outcomes (n = 7).  no meaningful descriptive statistics due to outcome and methodological heterogeneity. For all imaging studies, a mean of 1666 images were used. However, it is worth noting that the number of images included in studies with relatively heterogenous image types (i.e., plain film radiography) was not distributed in a Gaussian or normally distributed curve, with the median being significantly different to the mean. This was demonstrated with median (585 images) divergent from the mean (1940 images). Figure 6 shows the relative frequency peaks for visual analysis.
For image recognition, there appears to have been a shift to the use of U-Net over other architectures with all nine studies utilising or comparing this platform published in 2021 or 2022 [29][30][31][32][33][34][35][36][37]. When compared against other architectures, Dense U-Net out performed a standard U-Net [35], or U-Net was found to be optimised with a ResNet Encoder [36].
A number of studies involved patient data (e.g., metabolites or Electronic Health Record (EHR) data used support vector machine algorithms (SVM) [38][39][40][41][42][43]. The difference in data studies showed significant heterogeneity in methods, observations, and outcomes, and as such, no relevant statistical outcomes can be formed. However, descriptively, SVM showed either comparative outcomes [39] or reduced predictive capabilities [43] when compared to other ML formats such as ANN multilayer perceptron (MLP), random forest (RF), or naïve Bayes (NB).
For image recognition, there appears to have been a shift to the use of U-Net over other architectures with all nine studies utilising or comparing this platform published in 2021 or 2022 [29][30][31][32][33][34][35][36][37]. When compared against other architectures, Dense U-Net out performed a standard U-Net [35], or U-Net was found to be optimised with a ResNet Encoder [36].
A number of studies involved patient data (e.g., metabolites or Electronic Health Record (EHR) data used support vector machine algorithms (SVM) [38][39][40][41][42][43]. The difference in data studies showed significant heterogeneity in methods, observations, and outcomes, and as such, no relevant statistical outcomes can be formed. However, descriptively, SVM showed either comparative outcomes [39] or reduced predictive capabilities [43] when compared to other ML formats such as ANN multilayer perceptron (MLP), random forest (RF), or naïve Bayes (NB).

Training and Annotation
The majority of patient data research in this literature body focused on using CNNs to assist with regression analyses. In these cases, training is not required as the models are sourcing regression analyses rather than replicating human activity.
In the case of image data processing, CNNs are mimicking humans and, as such, require training. General dentists performed training annotation in 47% of papers (n = 16), some form of specialist dentist or trainee specialist dentist performed training annotation in 16% of papers (n = 6), and radiologists in 8% (n = 3) of studies. Dental hygienists were also used (n = 1), as well as mixtures of clinicians (n = 6).
The methods of labelling were, in the majority, manual annotation by drawing or labelling the external pixels of required features (n = 25). However, in a more recent paper [44], a process of 'dye staining' was used, whereby annotators merely highlighted areas of interest with a CNN used to ascertain characteristics around these single or multiple-point annotations had occurred.

Outcome Metrics and Comparative Texts
As is the nature of a scoping review there is vast heterogeneity in data forms, methodologies which result in very different outcome metrics [26,27]. These included an array of best fit measurements including F1 and F2 scores, precision, and accuracy, alongside sensitivity and specificity. Area under the curve analyses and ICC between test sets and representative expert labels were also frequently quoted. More specific imaging outcomes were also used with Jaccard's Index, Pixel Accuracy, and Hausdorff Difference utilised. This made a meaningful statistical comparison of outcomes difficult due to the vast number of analyses presented.
Descriptively, as one would expect, when ML was asked to produce nominal outcomes, accuracy increased. In cases where outcomes were dichotomous, 90-98% accuracy was reported [45]. Within research that is more image focused, this is reflected when the task is more simple, such as in the work of Kong et al., with gross recognition of periodontal bone loss reporting an accuracy of 98% [46]. However, as the task asked of the ML increases in complexity, the accuracy was shown to drop. This is possibly best illustrated by the eloquent research of Lee et al. [36]. This study assessed several parameters to provide best outcomes for the radiographic staging of periodontitis, showing a U-Net with ResNet Encoder-50 for the majority of its image analysis. A power calculation was performed for gross bone loss. Here, Lee et al. describes greatest accuracy (0.98) where there was no bone loss, with reduction in accuracy (0.89) for fine increments such as for minimal bone loss (stage 1). CNN used to develop an automated diagnostic support system assessing periodontal bone loss in panoramic dental radiographs. Automatic method for staging periodontitis on dental panoramic radiographs using the deep learning hybrid method. The study aimed to design a support vector machine (SVM)-based decision-making support system to diagnose various periodontal diseases. 21 Huang [40] China 2020 Gingival crevicular fluid 25 SVM n.a. n.a.
Classification models achieved greater than or equal to 91% in classifying SP patients, with LDA being the highest at 97.5% accuracy.
This study highlights the potential of antibody arrays to diagnose severe periodontal disease by testing five models (SVM, RF, kNN, LDA, CART). 22 Kim [ Assessment of three models of data mining to provide a predictive decision model for peri-implant health. Testing of commercially available product Diagnocat in the evaluation of panoramic radiographs.

Discussion
CNNs are becoming increasingly clinically relevant in their ability to assess imaging data and can be an excellent utility for analysing large clinically relevant datasets. In the present review, we systematically compiled the application of CNNs in the field of periodontology, evaluating the application and outcomes of these studies. The majority of these studies were completed in the USA and China, which is line with the majority of DL papers in the medical spheres [79].
The use of CNNs in periodontal and dental research has continuously grown over the last decade. Since the first publications using CNNs in the early 2010s, there has been an exponential increase in the number of publications using this tool. In 2021 alone, there were 2911 registered studies on PubMed with CNN in the title, up from 42 in 2010. This, of course, makes empirical sense. The power, utility and applicability of this tool are endless and are improving as the architectures evolve, providing both generic and task-focused utility.
Previous systematic and scoping reviews in dentistry have highlighted the underuse of this tool and a lag for dental research in this area [80,81]. However, with the total number of periodontal imaging papers alone now equalling the number of dental imaging papers in 2018, this lag is likely to have been overcome. This is unequivocally to the betterment of dental patient care when considering the benefits patients have enjoyed through similar endeavours in medicine [82].
The advantage of a broad search is the volume of literature that is assessed. However, it must be noted that a significant portion of the literature was derived from technical standards, conference proceedings, and related materials on computer-science-oriented repositories rather than journal articles. This is advantageous for the authors because the time to publication can be significantly reduced by removing the requirement for peer or public review. This may suit the rapidly evolving world of computer science, where breakthroughs can occur at breakneck speed, but it is unclear if the intrinsic validity of these publications is reduced due to a lack of public/peer scrutiny. This literature is published by technical scientists and therefore reported differently to how clinicians might expect it.
The majority of studies focused on the processing of images and the recognition of structures. Radiography accounted for two-thirds of these data ( Figure 5), reasonably evenly split between periapical and panoramic radiographs. With both imaging modalities indicated in the assessment of patients with periodontitis, virtual assistance in diagnosis will be relevant to the clinician. In these studies, the majority of studies focused on image segmentation rather than pathology detection. We can only assume that this was due to the relative complexity of a detection tool compared to segmentation. However, moving forwards, relative detection from consecutive radiographs or more pathology identification would be of use as an assistive tool.
Database collation is an issue for all data science. This is of significant difficulty when compiling data from medical records. When considering the field of machine learning, suitable numbers of records are required to train and refine an AI tool. Supervised training is recognised as a central tenant to improve a CNN's performance. This process requires that the CNN is shown labelled images to define structures for the CNN to segment. It is, however, possible to over-train CNNs, resulting in errors due to over-recognition.
With homogenous data, it would not be unreasonable to believe that the distribution of training input utilised would be Gaussian or normally distributed. However, we find that the mean number of plain film radiographs utilised for training was multinomial in distribution. This is best illustrated by the divergent mean (1940) and median (585). Figure 6 shows a histogram of the data, where peaks can be seen between 100 and 1000 images and over 4000 images used. The authors query whether this was related to convenient sampling implicit with smaller numbers of data or whether this is reflective of the inherent variabilities of the requirements of the differing architectures of ML used. A power calculation was performed in a single paper, but only for a single outcome, and as such may not have offered the researcher team an accurate assessment of data volume required [37]. Further to this, there was no notable descriptive correlation between outcome-reported accuracy and the number of images used, with larger studies reporting a variety of accuracies or other outcomes [30,54,59,62,78]. This may be solely reflective of the noted variety of reporting outcomes and methodologies rather than due to a lack of correlation.
Labelling these images for training and subsequent reference tests is also of paramount importance. Gold standards were applied in several studies whereby more than one clinician or a specialist radiologist was used to perform this manual task to ensure a consensus approach to best fit was taken. However, in many studies, this was performed by a single evaluator, reducing both the external validity of the results due to single operator bias and the internal validity, introducing potential systematic error. Whilst some efforts have been made to standardise methodologies [83,84], these are still yet to be adopted or referenced in wider practice. The majority of studies used pixel-by-pixel annotation tools, with single study moving to 'Dye Staining or 'Grab Cut' methodology [73]. This practice changes the digital annotation sequencing, essentially highlighting areas of interest to the CNN rather than circumscribing the anatomical feature of relevance. This is already mainstream in several other fields [85], possibly highlighting the digital technical lag present in dental research and the opportunities available in this field.
The performance of CNNs was reported in very heterogeneous manners, almost all of which come with drawbacks. Area under the curve (AUC) analyses are important, but only partially informative when it comes to outcomes. Lying above the curve is just a minimum requirement and, typically, only very elevated values are representative of applications suitable to handle real-world data; thus, sensitivity and specificity need to be reported as well to indicate performance. This was the case in some later papers, but those additional values were missing from most papers where AUC was a reported outcome. Accuracy was the most commonly reported outcome known to distort results when class imbalances are in place [86]. This is due to the class distribution being unknown in the training data, meaning that there is an assumption as to which population is more present (i.e., bone loss or no bone loss). This assumption skews data, making the reported ≈70-90% accuracy less meaningful due to inherent guesswork in ascertaining the class.
It has been suggested that the gold standard employed in these papers should be for models to be tested against independent expert assessors on truly unseen data, or indeed for the models to be used in a clinical trial [83]. However, no studies included compare the outcomes from a truly independent examiner team on unseen data against the proposed AI/ML outcomes. This leads to a 'fuzzy gold standard', whereby the AI/ML outcome is being marked against the clinician examiners that were used to train the tool. Until literature providing baseline information on the efficacy of examiners is universally accepted or studies showing performance against truly independent dentist performance in a clinical environment is shown, the assistive nature of these tools in a 'real world setting' will remain unproven.
With the marked reporting heterogeneity and uncertainty of gold standard testing, it is not surprising that gauging meaningful comparison and pooling data for meta-analysis has not been done. This may be indicated in a further systematic review in the future as more standardisation of this form of research occurs. However, descriptively, when considering outcomes, this body of research could show little improvement of accuracies over the last nine years, with accuracies of 90-98% reported in 2014 [45] and accuracies of 0.85 described in 2022. However, we feel that whilst the reported figures have remained similar, the question has markedly changed over this time. Ever increasingly complex problems are being asked of ML. Whilst historical papers asked more nominal questions, more recent literature such as Jiang et al. [37] looked to radiographically stage periodontitis from OPG radiographs. This represents a continuous problem that is grouped into ordinal datasets. Here, the accuracy established was similar to the periodontists who had completed the original manual labelling in the region of 0.85, depending upon tooth position and severity of bone loss. This shows a significant improvement in the quality of the outcome and indicates that as the power of ML increases, the assistive nature of these tools may become more powerful.
This scoping review comes with its inherent limitations. The chosen question is as broad; as is inherent within the purpose of a scoping review [27], reflected by a broad search strategy. Whilst including articles from resources such as ArXiv offers the readers a larger pool of references, it should be noted that these articles are not peer-reviewed and therefore may lack some of the methodological rigour of published literature. In addition to this, the broad inclusion criterion has led to the authors somewhat controversially opting to include papers using a broad variety of machine learning and artificial intelligence modalities employed to analyse a broad range of data types. The resulting heterogeneity of the literature reduced the opportunity for meaningful outcome data comparison. However, the authors would add that they agree with the findings of referenced homogenisation efforts to reduce the variability in results expressed in this field.
Methodologically, the main frailty revolves around searches and synthesis being performed by a single reviewer, with cursory checks by second and third authors. Whilst the single reviewer completed the synthesis twice with an extended time interval between reviews, this still forms an area of inherent bias. However, this was necessary with the practicality of this study.

Conclusions
Overall, this review gives insight into the application of machine learning in the field of Periodontology. Given artificial intelligence's relative infancy in healthcare, it is not surprising that significant heterogeneity was found in the methodology and reporting outcomes. All efforts should be made to bring further research in line with increasingly recognised gold standard for research and reporting. International agreement on a gold standard against which to measure these tools would also significantly assist readers in assessing the utility of this modality of tool. As such, at this juncture, no accurate conclusions can be drawn as to the efficacy and usefulness of this tool in the field of periodontology. Institutional Review Board Statement: Ethical review and approval were waived for this study due to the retrospective analytical nature of this study.

Informed Consent Statement: Not applicable.
Data Availability Statement: Further data required may be requested from the corresponding author.