The Reliability of Three-Dimensional Landmark-Based Craniomaxillofacial and Airway Cephalometric Analysis

Cephalometric analysis is a standard diagnostic tool in orthodontics and craniofacial surgery. Today, as conventional 2D cephalometry is limited and susceptible to analysis bias, a more reliable and user-friendly three-dimensional system that includes hard tissue, soft tissue, and airways is demanded in clinical practice. We launched our study to develop such a system based on CT data and landmarks. This study aims to determine whether the data labeled through our process is highly qualified and whether the soft tissue and airway data derived from CT scans are reliable. We enrolled 15 patients (seven males, eight females, 26.47 ± 3.44 years old) diagnosed with either non-syndromic dento–maxillofacial deformities or OSDB in this study to evaluate the intra- and inter-examiner reliability of our system. A total of 126 landmarks were adopted and divided into five sets by region: 28 cranial points, 25 mandibular points, 20 teeth points, 48 soft tissue points, and 6 airway points. All the landmarks were labeled by two experienced clinical practitioners, either of whom had labeled all the data twice at least one month apart. Furthermore, 78 parameters of three sets were calculated in this study: 42 skeletal parameters (23 angular and 19 linear), 27 soft tissue parameters (9 angular and 18 linear), and 9 upper airway parameters (2 linear, 4 areal, and 3 voluminal). Intraclass correlation coefficient (ICC) was used to evaluate the inter-examiner and intra-examiner reliability of landmark coordinate values and measurement parameters. The overwhelming majority of the landmarks showed excellent intra- and inter-examiner reliability. For skeletal parameters, angular parameters indicated better reliability, while linear parameters performed better for soft tissue parameters. The intra- and inter-examiner ICCs of airway parameters referred to excellent reliability. In summary, the data labeled through our process are qualified, and the soft tissue and airway data derived from CT scans are reliable. Landmarks that are not commonly used in clinical practice may require additional attention while labeling as they are prone to poor reliability. Measurement parameters with values close to 0 tend to have low reliability. We believe this three-dimensional cephalometric system would reach clinical application.


Introduction
Cephalometric analysis, first introduced by Hofrath H [1] and Broadbent BH [2], has been a standard diagnostic tool in orthodontics and craniofacial surgery for the last few decades [3][4][5][6]. As is well known in orthodontic and orthognathic surgery fields, the accurate quantification of deformities and precise surgical planning requires the digitization Diagnostics 2023, 13, 2360 2 of 23 (detection and localization) of cranio-maxillofacial (CMF) landmarks. Initially, cephalometry was focused on skeletal structures and could only assess two dimensions [7][8][9][10][11]. As two-dimensional cephalometric analysis evolved, soft-tissue cephalometric analyses were established for the evaluation of attendant soft-tissue changes and esthetic considerations [12][13][14][15][16]. Moreover, as the factor that patients with obstructive-sleep disordered breathing (OSDB) show certain craniofacial defects that may influence pharyngeal patency received attention, cephalometric analyses focused on airways were introduced [17][18][19]. Two-dimensional cephalometry is widely adopted in clinical practice due to its simplicity, convenience, and certain reliability. However, conventional cephalometry is susceptible to analysis bias due to the difficulty in determining some landmarks with high accuracy and reliability because of the superimposition of anatomic structures [20][21][22].
To overcome the drawback, three-dimensional cephalometric analysis was introduced. The fundamental basis for digital three-dimensional cephalometry is the data on the head and facial structure. Currently, many technologies (such as computed tomography (CT) [23,24], cone beam computed tomography (CBCT) [25], magnetic resonance imaging (MRI) [26,27], and facial scanning [28]) can provide high-resolution images without overlapping or distortion, which results in high-quality diagnostic images. In three-dimensional cephalometry, more landmarks, reference planes, and measurement parameters can be selected to enrich the analysis content of bone, soft tissue, and airway anatomy [29][30][31][32][33][34][35][36]. Measuring volumes (especially airway volumes) and visual asymmetry evaluation become possible. Three-dimensional data provide potentially useful information compared to two-dimensional data. However, this also brings a large amount of redundant information, which poses high demands on the processing of three-dimensional data. In clinical practice, landmark digitization (especially three-dimensional) is still performed manually, which is time-consuming, error-prone, and experience-dependent. Fast and reliable automated landmark digitization systems are highly desirable by clinicians. Recently, many automated landmark digitization systems have been established with a certain level of accuracy, motivated by the successes of machine learning in the field of medical image analysis [37][38][39][40][41].
To create a more reliable and user-friendly three-dimensional cephalometric system, we conducted a study to establish an automated multimodal measurement system that includes hard tissue, soft tissue, and airways. The first issue that needs to be addressed is high-quality training data. The performance of machine learning models is based on training data. Since it is difficult to surpass training data, unreliable data is difficult to train robust models. The lack of high-quality data is one of the obstacles to improving the accuracy of machine learning, especially in the field of medical image analysis [42][43][44][45]. Although many studies have been conducted on the reliability of landmarks [46][47][48], most are based on commonly used landmarks in clinical practice. Our system hopes to explore more clinical information and has introduced some non-commonly used landmarks, such as the ones used to evaluate soft tissue nasal anatomy. For these landmarks that most clinical experts have not marked, their reliability remains questionable. In addition, the former studies also indicate that important variations were observed in the experimental methods regarding the parameters of image acquisition, software, types of visualization, and the marked anatomic references. After careful consideration, CT data were chosen as the data source for our measurement system. Most patients with cranio-maxillofacial related disorders need to undergo a CT scan for diagnosis and surgical design. Soft tissue and airway structures can be obtained from CT data as well, thus avoiding inconvenience and extra radiation. To ensure the accuracy of our training data as much as possible, we must perform a reliability check on the landmarks and measurement parameters we select. At the same time, we also want to evaluate whether clinical experts can achieve a high level of reliability based on the definition of some less commonly used landmarks. We believe this is a necessary step before establishing a reliable automatic three-dimensional measurement system. This will make the thousands of labeled data we establish based on our process more convincing. We also plan to open-source the data in the future and promote the application of three-dimensional craniofacial measurement systems in clinical practice. We hope that our system can bring more efficient and detailed clinical data to clinicians and patients in the future.

Materials and Methods
CT scans for this study were derived from a pre-existing clinical database of preorthognathic treatment records, and the study protocol was approved by the institution review board of Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine (SH9H-2022-T45-2). No additional radiographic images were taken for study purposes. All the CT scans were taken in 2020 and anonymized. Data from 15 patients (seven males, eight females, 26.47 ± 3.44 years old) diagnosed with either nonsyndromic dento-maxillofacial deformities or OSDB were included in this study. The CT DICOM files [49] were converted into point cloud data with voxel size of 0.5 × 0.5 × 0.5 mm [3], the scalar of which is the value of Hounsfield unit (Hu) [50]. The data were resampled using cubic spline interpolation. This method has been shown to be effective for resampling data due to its good interpolating performance and processing efficiency in previous studies [51][52][53][54]. SciPy package version 1.7.3 [55] based on Python version 3.7.3 [56] was used for data resampling.
In this study, 126 landmarks were adopted and were divided into five sets by region: 28 cranial points (8 median and 10 bilateral), 25 mandibular points (7 median and 9 bilateral), 20 teeth points (10 bilateral), 48 soft tissue points (14 median and 17 bilateral), and 6 airway points (6 median). The alv(PNS) point was included in both cranial and airway sets (Table 1, Figures 1-5) [3,33,34,[57][58][59]. All the landmarks were labeled by two experienced clinical practitioners using 3D Slicer software (version 4.13.0, https://www.slicer.org/ accessed on 23 March 2022.) [60], either of whom had labeled all the data twice at least one month apart. Three-dimensional reconstructions of the skeletal structure, teeth, soft tissue, and upper airway were created and exported as VTK [61] files using 3D Slicer before labeling ( Figure 6).                   Table 2). Furthermore, 78 parameters of three sets were used in this study: 42 skeletal parameters (23 angular and 19 linear), 27 soft tissue parameters (9 angular and 18 linear), and 9 upper airway parameters (2 linear, 4 areal, and 3 voluminal) ( Table 3).  Table 2). Furthermore, 78 parameters of three sets were used in this study: 42 skeletal parameters (23 angular and 19 linear), 27 soft tissue parameters (9 angular and 18 linear), and 9 upper airway parameters (2 linear, 4 areal, and 3 voluminal) ( Table 3). The reliability consists of two aspects: inter-examiner reliability and intra-examiner reliability [63]. Inter-examiner reliability refers to the consistency between different examiners while intra-examiner reliability means the ability of an examiner to record the same conditions the same way over time. In this study, intraclass correlation coefficient (ICC) was used to evaluate inter-examiner and intra-examiner reliability [6,64]. For inter-examiner reliability, the values from two sets of landmark coordinates and cephalometric analyses were used, and ICC estimates and their 95% confident intervals were calculated based on a single-measurement, absolute-agreement, and two-way random-effects model. For intra-examiner reliability, the average value of two sets of landmark coordinate values and cephalometric analyses from each examiner was used. ICC values less than 0.5 are indicative of poor reliability, values between 0.5 and 0.75 indicate moderate reliability, values between 0.75 and 0.9 indicate good reliability, and values greater than 0.90 indicate excellent reliability. All the ICC estimates were calculated using Pingouin statistical package version 0.5.2 [65] based on Python version 3.7.3 [56].

Classification Measurement Parameters Definition
Soft Tissue The average cross-sectional area of the airway enclosed by the planes parallel to FH and passing through alv(PNS) and c3 Velopharynx, Mean| a The average cross-sectional area of the airway enclosed by the planes parallel to FH and passing through alv(PNS) and u Glossopharynx, Mean| a The average cross-sectional area of the airway enclosed by the planes parallel to FH and passing through u and c3 Airway, Min| a The minimum cross-sectional area (1 mm per step) of the airway enclosed by the planes parallel to FH and passing through alv(PNS) and c3 Voluminal Airway| v 7 The volume of the airway enclosed by the planes parallel to FH and passing through alv(PNS) and c3 Velopharynx| v The volume of the airway enclosed by the planes parallel to FH and passing through alv(PNS) and u Glossopharynx| v The volume of the airway enclosed by the planes parallel to FH and passing through u and c3

The Intra-and Inter-Examiner Reliability of Landmark Coordinate Values
We first compared the intra-and inter-examiner ICCs for all landmark coordinate values with 15 samples together. Each landmark is a point of R 3 in the left-posteriorsuperior coordinate system (LPS). The inter-examiner and the intra-examiner ICCs for each landmark are listed in Table 4. The overwhelming majority of the landmarks showed excellent intra-and inter-examiner reliability. The intra-examiner reliability of landmarks is better than the inter-examiner reliability. For poorly performing landmarks, there are two conditions. (1) The ICC value is poor (less than 0.75) in the reproducibility in the S direction of pg'; the reproducibility and repeatability in the P direction of zy'_L; the reproducibility in the S direction of zy'_R; and the repeatability in the P direction. (2). The ICC value is good but the lower bound of 95% confidence interval is less than 0.50, which might indicate potential poor performance and is only observed in the reproducibility in the L direction of or_L, ecm_u6_L, ecm_l6_L, ecm_l6_R, u1d_R, mc'_L, sbal'_R, and vi'_L; the P direction of zy_L, gn, me, ag_R, ecm_l6_L, pg', gn', al'_L, and sbal'_R; the S direction of ecm_u7_L, g', se', pn', sn', gn', sbal'_R, vs'_L, vs'_R, zy'_L, and zy'_R; the LS direction of ecm_u6_R, ecm_u7_R, enm_u7_L, and enm_u7_R; and the LPS direction of sbal'_R. The majority of the landmarks with poor performance are non-commonly used ones.

The Intra-and Inter-Examiner Reliability of Measurement Parameters
The intra-and inter-examiner ICCs for measurement parameters with 15 samples were then calculated. The inter-examiner and the intra-examiner ICCs for each parameter are listed in Table 5. Most of the parameters showed excellent intra-and inter-examiner reliability. For skeletal parameters, angular parameters indicated better reliability, while linear parameters performed better for soft tissue parameters. The intra-and inter-examiner ICCs of airway parameters referred to excellent reliability. For poorly performing parameters, two conditions were also observed: (1) The ICC value is poor (less than 0.75) in the reproducibility and repeatability of Maxillary Yawing| • and the repeatability of Go Canting| • , mp'-sm'| d , pn'-sn'| d , Inner Canthic Diameter| d , and Upper Vermilion Width| d . (2) The ICC value is good, but the lower bound of 95% confidence interval is less than 0.50 in the reproducibility of Upper Vermilion Width| d and the repeatability of Condyle Yaw R| • . The repeatability of examiner 2 seemed to be relatively poor.

Discussion
In this study, we evaluated the inter-and intra-examiner reliability of our threedimensional landmark-based cranio-maxillofacial and airway cephalometric analysis in both landmark and measurement parameter levels. We aimed to determine whether the data labeled through our process are highly qualified and whether the soft tissue and airway data derived from CT scans are reliable.
Landmarks in our study were R 3 points, based on which all the measurement parameters were calculated. As a result, the reliability of landmarks matters. Bookstein introduced three types of landmarks (I, II, and III) to elucidate their character and degree of reliability [66]. To reach a more robust measurement system, the points derived from anatomical structures (Bookstein class I) were preferred [66]. Our study showed that the reliability of most of the landmarks was excellent, though we still found some with relatively poor performance. Landmarks that are not commonly used in clinical practice may require additional attention as they may have poor reliability. We will take extra care when labeling these landmarks.
For the skeletal cranial landmarks, the poorest performing one was the anteriorposterior direction of the Zygion (zy), which is defined as the most lateral point of the zygomatic arch. This point is not based on specific anatomical structure and requires the examiner to estimate the position based on visual observation. Due to the arched shape of the zygomatic arch, small fluctuations in the left-right direction can cause significant changes in the anterior-posterior direction. Currently, the Zygion (zy) is mainly used to measure the width of the face, so as long as the left-right direction fluctuation is small, it has little impact on the measurement output. However, for safety reasons, we also selected some alternative landmarks: the Mastoidale (ms), the Zygomaxillare (zm), and the Jugale (ju), which had much better repeatability and reproducibility.
While the landmarks belonging to mandibular, teeth, and airway structure showed great repeatability and reproducibility in our study, some points of soft tissue were suboptimal. Like the skeletal zygion (zy), both the reproducibility and repeatability of the anterior-posterior direction of the soft tissue zygion (zy') were poor. Unlike skeletal pogonion (pg), we noticed the low inter-observer ICC of the inferior-superior direction of the soft tissue pogonion (pg'), which indicated the potential labeling deviation between the two raters. After reviewing our data, we speculated that the possible reason was the discrepancy between the curvature of soft tissue and hard tissue in facial contour analysis. The soft tissue was more flexible and had lower radii of curvature than hard tissue, which made it harder to locate the pg'.
To improve the reliability of the landmarks with relatively poor performance, we initiated a project to optimize the Bookstein type II landmark labeling with computer assistance. The program would relocate the labeled landmark based on numerical calculation. Just like other researchers [37][38][39][40]67,68], we attempted to establish an automatic labeling system based on a machine-learning technique as well.
In our measurement system, the output parameters were calculated by landmarks and rules. The reliability of measurement indicators might not be completely equivalent to the reliability of landmarks. Thus, we evaluated the repeatability and reproducibility of parameters as well. In our study, linear parameters of hard tissue seemed to be more robust than the angular ones, while the opposite is true for parameters of soft tissue. Maxillary Yawing| • demonstrated poor reproducibility and repeatability, while Go Canting| • showed low repeatability. Maxillary Yawing| • is designed to evaluate the yawing of the maxilla, the closer the value of which is to 0, the less skewed maxilla an individual has. Most patients have low maxillary yawing values, so the value of Maxillary Yawing| • is close to 0, and we believe this is the main reason for its low reliability. Go Canting| • is derived from the go (Gonion) points. The consistency of examiner 2's Go Canting| • in our study was relatively poor, while its go points consistency was still at a relatively high level. This indicated that there may be an amplification of deviations in the calculation process from point to measurement value. We found similar phenomena in Upper Vermilion Width| d and Inner Canthic Diameter| d . As a result, in our subsequent studies, we will explore and quantify the changes in errors between points and measurement values.
In our study, the parameters of the airway were stable. For airway indicators, a threedimensional measurement may describe the airway in a better way. Since OSDB is caused by upper airway collapse, the aim of clinical treatment for OSDB is to find and relieve the narrowest region of the airway [69][70][71][72][73][74][75][76][77][78][79][80]. Fortunately, the important indicator of the narrowest area of the airway (Airway, Min| a in our study) is proven to be robust.
Unfortunately, only nine parameters of the airway were adopted in this study. One reason is that research studies on airway morphology are still in their very early stages, and there are relatively few parameters that are clinically applicable. On the other hand, due to technical limitations, some indicators are too complex to be calculated efficiently and stably. For example, in airway assessment, nasal cavity volume is actually a very important parameter since nasal stenosis can also lead to OSDB. Currently, the segmentation of the nasal cavity is still based on air/soft tissue thresholds. Since the nasal cavity is not only connected to the pharynx but also to the nasal sinuses, segmentation based on thresholds will usually segment out the nasal sinus cavity (Figure 7). To exclude the nasal sinus cavity, manual erasure is required, which is labor-intensive and may lead to decreased accuracy due to unclear boundaries between the nasal sinuses and the nasal cavity. In addition, as the size of the nasal sinuses varies among patients, including them in nasal cavity volume measurement could not indicate nasal cavity morphology correctly. As the measurement of the nasal cavity should be based on the efficient and accurate segmentation of the relevant structure of the nasal cavity, which is currently beyond the scope of this study, this part of the study did not include airway measurements related to the nasal cavity. We plan to further explore this in subsequent studies.
Diagnostics 2023, 13, x FOR PEER REVIEW 20 of 24 due to technical limitations, some indicators are too complex to be calculated efficiently and stably. For example, in airway assessment, nasal cavity volume is actually a very important parameter since nasal stenosis can also lead to OSDB. Currently, the segmentation of the nasal cavity is still based on air/soft tissue thresholds. Since the nasal cavity is not only connected to the pharynx but also to the nasal sinuses, segmentation based on thresholds will usually segment out the nasal sinus cavity (Figure 7). To exclude the nasal sinus cavity, manual erasure is required, which is labor-intensive and may lead to decreased accuracy due to unclear boundaries between the nasal sinuses and the nasal cavity. In addition, as the size of the nasal sinuses varies among patients, including them in nasal cavity volume measurement could not indicate nasal cavity morphology correctly. As the measurement of the nasal cavity should be based on the efficient and accurate segmentation of the relevant structure of the nasal cavity, which is currently beyond the scope of this study, this part of the study did not include airway measurements related to the nasal cavity. We plan to further explore this in subsequent studies. Basalis, (d) Norma Verticalis. In images above, (1) represents the maxillary sinus air cavity structure, (2) represents the frontal sinus air cavity structure, (3) represents the ethmoid sinus air cavity structure, and (4) represents the nasal cavity airway structure.
The work in this paper is the predecessor work of our automatic 3D cephalometric system project as well. Due to the large amount of information added by 3D measurement compared to 2D measurement, the high cost of manual processing has become a major obstacle for 3D measurement systems to move towards clinical application. We believe that an automatic processing system is an effective solution. We plan to use machinelearning techniques to achieve automatic labeling of landmarks. To achieve excellent autolabeling models, data with correct labeling need to be prepared first. Our work proved Basalis, (d) Norma Verticalis. In images above, (1) represents the maxillary sinus air cavity structure, (2) represents the frontal sinus air cavity structure, (3) represents the ethmoid sinus air cavity structure, and (4) represents the nasal cavity airway structure.
The work in this paper is the predecessor work of our automatic 3D cephalometric system project as well. Due to the large amount of information added by 3D measurement compared to 2D measurement, the high cost of manual processing has become a major obstacle for 3D measurement systems to move towards clinical application. We believe that an automatic processing system is an effective solution. We plan to use machine-learning techniques to achieve automatic labeling of landmarks. To achieve excellent auto-labeling models, data with correct labeling need to be prepared first. Our work proved that the reliability of the system mentioned in this paper was excellent; thus, we believe the training data could be highly qualified. Talking about the 3D cephalometric system, whether a landmark-based cephalometric technique is still applicable is a question worth pondering.
Compared to 2D parameters, we could have parameters of symmetry, volume, and so on, and for these indicators, the role of landmarks may be the key to quickly locating the region of interest. As a result, our 3D system is designed to be able to keep updating our landmark list automatically to adapt to new demands.

Conclusions
In summary, we introduced a three-dimensional measurement system, the content of which covers hard tissue, soft tissue, and the airway. The repeatability and reproducibility of the measurement system were evaluated and proven to be robust enough for clinical practice by two aspects: landmark coordinates and measurement parameters. The data labeled through our process are qualified, and the soft tissue and airway data derived from CT scans are reliable. Landmarks that are not commonly used in clinical practice may require additional attention while labeling as they are prone to poor reliability. Measurement parameters with values close to 0 tend to have low reliability. The role of landmarks may be key to quickly locating regions of interest in successor three-dimensional cephalometric systems. We believe this three-dimensional cephalometric system would reach clinical application and help clinical practitioners improve the quality of clinical practice.
Author Contributions: Conceptualization, K.Y. and G.S.; methodology, K.Y. and W.Y.; software, K.Y. and L.X.; validation, K.Y., Y.X. and S.W.; formal analysis, K.Y. and Y.X.; investigation, K.Y.; resources, K.Y.; data curation, K.Y.; writing-original draft preparation, K.Y. and Y.X.; writingreview and editing, L.X., S.W., W.Y. and G.S.; visualization, K.Y. and Y.X.; supervision, K.Y.; project administration, K.Y.; funding acquisition, K.Y. All authors have read and agreed to the published version of the manuscript. Informed Consent Statement: Patient consent was waived as CT scans for this study were derived from the pre-existing clinical database of cranio-maxillofacial related disorders treatment records, no additional radiologic images were taken for the current study, and data were desensitized in the current study.

Data Availability Statement:
The data presented in this study are available on reasonable request from the corresponding author. The data are not publicly available due to privacy issues and regulation policies in hospitals. All requests about the semi-automatic system, labeling, comparison and evaluation can be sent to the first author (K.Y.).

Conflicts of Interest:
The authors declare no conflict of interest.