1. Introduction
As the global usage of eyewear continues to grow, the demand for comfort and customization is likewise on the rise [
1]. During the process of selecting lenses and frames, in addition to measuring refractive errors (e.g., myopia and astigmatism) and dispensing, two geometric parameters are critical: interpupillary distance (PD) and pupillary height (PH) [
2]. PD is the horizontal distance between the centers of the left and right pupils, and it is used to align the optical centers of the lenses with the visual axes. PH is defined for each eye as the vertical distance from the pupil center (as seen through the lens) to the lowest point of the corresponding spectacle lens rim; this value determines how the optical center or segment height is positioned within the frame. Accurate PD and PH are required to avoid unwanted prism and to ensure comfortable binocular vision. Misalignment can lead to visual discomfort, blur, dizziness, or asthenopia and may reduce tolerance to prolonged wear [
3,
4]. In current optical dispensing practice, pupillary height (also called fitting height or segment height) is obtained by capturing a calibrated frontal view of the wearer, locating the pupil center for each eye, and measuring its vertical offset relative to the lower rim of the corresponding spectacle lens or a marked fitting cross. Commercial systems such as the Zeiss i.Terminal2 perform this automatically using calibrated imaging, but the required hardware is expensive and typically used in controlled lighting.
Currently, PD is often measured in optical shops or hospitals using pupillometers or PD rulers. However, the accuracy of these methods depends heavily on the operator’s skill, making it prone to errors that may result in prism effects [
5]. Although advanced digital positioning devices, such as Zeiss’s i.Terminal2, utilize 3D facial recognition to improve measurement accuracy, their high cost has limited their widespread adoption [
6]. To address this challenge, this paper proposes an intelligent pupil height and distance measurement system based on deep learning. The system captures frontal facial images using a camera, employs an ensemble regression tree (ERT) in tandem with the improved instance segmentation algorithm RC-BlendMask for pupil center localization and then computes the three-dimensional coordinates needed to output PD and PH automatically from a single captured frontal image. On our workstation, the full pipeline runs in under one second per frame. The system captures a single frontal image of the wearer, automatically localizes facial landmarks and the pupil centers, segments the spectacle lens rims, and then computes pupillary distance (PD) and pupillary height (PH) without manual ruler alignment. This reduces dependence on operator skill and minimizes manual measurement steps. Because the pipeline runs in under one second per frame on a consumer-grade camera/computer setup and does not rely on specialized multi-camera hardware such as the i.Terminal2, it is suitable for low-cost deployment in routine dispensing. Quantitatively, the mean absolute differences versus i.Terminal2 were 1.13 mm (PD), 0.73 mm (PH-L), and 0.89 mm (PH-R), which fall within commonly cited clinical tolerances for centration and prism (ANSI Z80.1/ISO 21987). These results indicate that clinically acceptable PH/PD measurements can be obtained with commodity hardware and minimal manual intervention.
The evolution of PD and PH measurement methods has undergone several stages. Early approaches relied on straightforward manual tools like PD rulers, which, although simple to operate, were vulnerable to issues such as parallax error—especially when dealing with eye position abnormalities (e.g., strabismus) [
7]. To improve accuracy, Xu Guangdi introduced a line-of-sight distance measurement method, which measures along the patient’s actual visual axis rather than relying on external facial landmarks, thereby reducing parallax-related error in cases such as strabismus [
8]. The maturation of optical and electronic measurement technologies led to more sophisticated instruments such as handheld pupilometers (e.g., HRK-7000A, Topcon PD-8, Nidek PD Meter II) [
9]. Zeiss’s i.Terminal2 system further advanced measurement precision by combining high-resolution cameras and advanced image processing, though it remains cost-prohibitive for broader adoption in many optical shops.
With rapid developments in the internet and computer vision, researchers have explored measuring PD and PH via software applications that utilize image processing and machine learning. For instance, Zheng et al. [
10] designed a mobile application that employs an OpenCV-built classifier to locate facial features and calculate PD by comparing pupil positions against a reference object. Other studies have similarly leveraged machine learning to improve pupil detection accuracy, indicating a growing trend toward data-driven methods [
11]. Traditional image processing techniques—such as those from Kumar et al. [
12] and Gu et al. [
13]—often rely on color features, thresholding, and contour tracking to locate the pupil. Although these methods can achieve high-speed detection in certain conditions, they are susceptible to noise, variations in lighting, and occlusion by eyelids, which may reduce accuracy or robustness. More recent works employing convolutional neural networks (CNNs), such as those by Lin et al. [
14], Li et al. [
15], and Sun et al. [
16], demonstrate improved precision and robustness in non-ideal settings. More recently, several studies have begun to fuse low-cost imaging, deep neural inference, and explicit 3D head/eye geometry to achieve fully contactless ocular and facial biometry. Barry and Wang showed that pupil size can be robustly quantified from standard RGB smartphone cameras across different skin tones and iris contrasts by learning a far-red–guided pupil segmentation and calibration pipeline, indicating that accurate pupillometry does not strictly require dedicated ophthalmic hardware [
17] Shen et al. combined per-eye keypoint detection with a time-of-flight depth camera and a geometric head–eye model to recover 3D gaze direction in real time, illustrating how learning-based feature localization and 3D reconstruction can be integrated for biometric gaze estimation [
18] Ben Barak-Dror et al. used short-wave infrared imaging together with learned pupil/eyelid modeling to perform rapid, contactless pupillometry and gaze estimation even with closed eyelids, highlighting clinical potential in non-cooperative or critical-care scenarios [
19]. Qammaz and Argyros presented an occlusion-tolerant pipeline that regresses 3D head pose and gaze direction from a single RGB view using a lightweight deep model, targeting real-time performance without multi-camera rigs [
20,
21,
22]. These recent works reinforce the trend toward low-cost, vision-based, calibration-aware ocular measurements, and motivate our goal: a single-camera system that estimates pupillary distance (PD) and pupillary height (PH) with accuracy comparable to specialized commercial devices while satisfying clinical tolerances. Recent work in facial landmark localization has moved beyond classical cascaded regressors, using dense 3D face modeling for large-pose alignment (e.g., 3DDFA) [
23], multi-stage CNN refinement with global facial context (e.g., Deep Alignment Network) [
24], and loss functions such as Wing and Adaptive Wing loss that emphasize small localization errors to improve robustness under occlusion and pose variation [
25]. High-capacity boundary-aware and stacked-hourglass architectures have pushed performance on both 2D and 3D alignment benchmarks, especially under challenging illumination and expression conditions [
26]. These advances form the technical background of our work: we target the same need for reliable ocular landmarks under real-world head pose and eyelid occlusion, but with a lightweight two-layer regression-tree cascade rather than a heavy multi-stage heatmap network so that the system can run on a single consumer camera in a retail dispensing setting [
27].
Building on prior work in computer vision–based optical measurement, this study makes the following contributions:
- (1)
Automated PD/PH measurement from a single RGB image.
We develop an end-to-end system that captures a single frontal image of the wearer, localizes facial landmarks, estimates the pupil centers, segments the spectacle lenses, and computes interpupillary distance (PD) and pupillary height (PH) in physical units. The workflow is shown in
Figure 1. The system is designed to run on commodity hardware without multi-camera rigs.
- (2)
Two-layer ERT landmarking with pupil-center refinement.
We train a two-layer Ensemble of Regression Trees (ERT) for facial keypoint localization and coarse pupil-center seeding. We then refine the pupil centers using direction-aware ray casting, edge-side–stratified RANSAC, and final least-squares circle fitting. This reduces sensitivity to eyelid occlusion and improves localization stability under non-ideal gaze/illumination.
- (3)
Spectacle lens segmentation via RC-BlendMask.
We introduce RC-BlendMask, an enhanced instance-segmentation model that fuses BlendMask with RCF-style edge features to suppress boundary diffusion and recover clean lens rims. Precise spectacle lens rim segmentation is required to identify the lowest point on each lens, which is used in the definition of pupillary height (PH is the vertical distance from the pupil center to the lowest point of the corresponding spectacle lens rim). Without reliable rim extraction, PH cannot be computed consistently.
- (4)
Head-pose gating and 3D pixel-to-mm calibration.
We estimate head pose with a PnP-based solver from 2D facial landmarks, recover Euler angles (yaw/pitch/roll), and reject frames whose pose exceeds predefined thresholds. We also use the recovered pose and camera intrinsics to perform pixel–millimeter conversion on the lens plane, enabling 3D-aware PD/PH estimation from a monocular camera.
- (5)
Quantitative robustness and agreement analysis.
We evaluate the system on 30 participants against a commercial device (Zeiss i.Terminal2). Robustness is quantified using multiple statistical measures: mean absolute error (MAE) and root-mean-squared error (RMSE), Pearson correlation, Bland–Altman bias and 95% limits of agreement (LOA), and repeatability metrics (within-subject SD, repeatability coefficient). The mean absolute differences versus i.Terminal2 were 1.13 mm for PD, 0.73 mm for left PH, and 0.89 mm for right PH, all within commonly cited ANSI Z80.1/ISO 21987 tolerances on decentration and unwanted prism. We also report segmentation quality (Precision, Recall, F1, IoU) for RC-BlendMask and head-pose MAE (yaw/pitch/roll) on public pose datasets.
- (6)
Practical deployment perspective.
Because our pipeline uses a single off-the-shelf camera and standard compute, rather than proprietary multi-camera hardware, it has the potential to lower measurement cost while maintaining clinically acceptable agreement with an established commercial reference.
Recent work in optical metrology and computational optics has shown that deep learning can augment physically motivated imaging pipelines, for example by suppressing stray light in wide-field astronomical imaging or learning to emulate ghost reflections in optical systems [
17,
18]. Our results extend this vision–geometry fusion trend to ophthalmic dispensing, targeting contactless PH/PD measurement suitable for routine retail and tele-optometry use.
This paper is structured as follows:
Section 2 details the materials and methods employed, including the ensemble regression trees for facial keypoint localization, the RC-BlendMask algorithm for lens segmentation, and the PNP-based head pose estimation technique.
Section 3 presents experimental results and discussion, showcasing comparative analyses against the Zeiss i.Terminal2 device, evaluating error sources, and demonstrating the system’s robustness. Finally,
Section 4 concludes by summarizing the primary findings, highlighting the system’s practical relevance, and discussing potential avenues for further refinement of both the algorithms and the overall measurement strategy.
3. Results
Following the synthesis of the aforementioned research, we conducted a detection and comparative experiment on the intelligent measurement system, which incorporates the algorithm designed for the precise measurement of pupillary height and distance. The i. Terminal2 device (Carl Zeiss AG, Guangzhou, China), recognized for its high-precision measurement capabilities, was employed as a benchmark to assess the performance of our intelligent system. The results of the experiment underscored the system’s superior accuracy and practicality.
To evaluate measurement accuracy, we collected data from 30 adult volunteers. Recruitment was done by convenience sampling at a single site; no pediatric subjects were included. We did not attempt to enforce demographic balance, so this group should not be interpreted as statistically representative of the general population. Each subject was imaged while wearing their own spectacles (with different frame geometries and coatings). Images were acquired with a fixed consumer camera mounted approximately 30 cm from eye level. During acquisition, each subject sat in front of a fixed consumer camera at approximately 30 cm under a consistent indoor lighting setup to reduce glare and strong reflections on the spectacle lenses. No head restraint was used. The system automatically estimated head pose (yaw, pitch, roll) from facial landmarks using a PnP-based solver. Frames in which any Euler angle exceeded predefined thresholds (approximately ±15° yaw/roll and ±10° pitch) were rejected and a new frontal frame was captured. In other words, instead of forcing strict fixation, we allowed natural behavior and simply discarded off-angle frames. This pose-gating procedure is the same mechanism used later in the measurement pipeline.
In addition to utilizing our intelligent measurement system, the Zeiss i.Terminal2, renowned for its precision, was employed to measure pupillary distance and height. The measurements obtained from the i.Terminal2 served as the benchmark for assessing the accuracy of our system. This comparative analysis allowed us to precisely evaluate and analyze the discrepancies between our intelligent measurement system and the industry’s high-precision standards, thereby ensuring the reliability and efficacy of the methods described in this paper for practical use. Such a rigorous comparative analysis is essential for the ongoing refinement and enhancement of the intelligent measurement system. Some measurement data are shown in
Table 6.
To graphically illustrate the discrepancies between the outcomes of the two measurement methodologies, the data are plotted for all subjects as shown in
Figure 19. The findings from the proposed intelligent measurement system closely align with those of the i.Terminal2, underscoring the high precision of the system’s measurements. We then performed a quantitative agreement analysis between the two methods. For each metric—interpupillary distance (PD), left pupillary height (PH-L), and right pupillary height (PH-R)—we computed the mean difference between the two systems, the corresponding standard deviation, the 95% Bland–Altman limits of agreement, and the maximum absolute difference across the 30 participants. These results are summarized in
Table 7. Overall, the average measurements from both systems are largely congruent, with the proposed system giving slightly higher values on average, and the observed dispersion (standard deviation) and 95% limits of agreement remaining within clinically acceptable tolerances.
The mean difference and standard deviation between the two methods are both less than 1.5 mm, which is considered an acceptable margin of error. The highest deviation in pupillary distance measurement was observed in the 16th experimental group, potentially due to anomalies during the image capture process. Nevertheless, this error remains within an acceptable range. This finding reaffirms the proximity of the results between the two measurement methods.
Employing the measurement data from the i.Terminal2 as the benchmark, we conducted a consistency analysis experiment on our visual intelligent measurement system. The experiment commenced with a correlation assessment to determine the linear relationship between the results derived from both measurement methods. The calculated data revealed a Pearson correlation coefficient of 0.944 for pupillary distance, 0.964 for the left pupillary height, and 0.916 for the right pupillary height. These figures all indicate a robust positive correlation. This finding suggests a highly linear relationship between the measurement outcomes of the visual intelligent measurement system and the i.Terminal2, thereby offering substantial statistical validation for the reliability of its measurement results.
A Bland–Altman analysis was performed to evaluate the concordance between the two measurement techniques. This approach adeptly exposes the extent of measurement discrepancies and any systematic bias by graphing the disparity between the two methods’ results against their mean values. The data presented in
Figure 19a discloses that the 95% confidence interval for pupillary distance extends from −2.00 mm to 2.70 mm. This indicates that 95% of the measurement discrepancies fall within this range, suggesting a high degree of concordance between the two measurement techniques. Furthermore, as illustrated in
Figure 19b,c, the 95% confidence interval for the left pupillary height is between −0.84 mm and 1.76 mm, while for the right pupillary height, it spans from −1.85 mm to 1.79 mm. These findings underscore the significant consistency in pupillary height measurements between our visual intelligent measurement system and the i.Terminal2.
Visual inspection of the graph indicates that the majority of measurement differences are tightly clustered around the mean difference line, with no discernible systematic bias or trend variation associated with the magnitude of the measurement values. This observation further substantiates the visual intelligent measurement system’s strengths in terms of consistency and reliability.
Determining the relative error is a pivotal analytical technique for assessing the precision of measurement data, as it quantifies the proportion of each measurement deviation in relation to a reference value—the i.Terminal2 measurement outcomes, in this case. In our experiment, we computed the relative errors for both the pupillary distance and height measurements.
Figure 20a demonstrates that the intelligent measurement system attained an average relative error of 1.81% in pupillary distance measurements. Regarding pupillary height, as depicted in
Figure 20b,c, the average relative error was 2.51% for the left eye and 2.96% for the right eye. In absolute terms, our mean absolute differences versus i.Terminal2 were 0.73 mm (PH-L), 0.89 mm (PH-R), and 1.13 mm (PD) (
Table 7). Standards for mounted lenses indicate ±1.0 mm tolerance for segment/fitting-cross vertical location (per lens) and limits on unwanted prism of ≤0.67Δ horizontal and ≤0.33Δ vertical, with an additional requirement that the prism reference point (PRP) not deviate by more than 1.0 mm from its specified position (ANSI Z80.1; see also ISO 21987) [
51,
52]. By Prentice’s rule (Δ = F·c, with c in cm), a 0.5 mm monocular centration error (0.05 cm) in a ±3.00 D lens induces ~0.15Δ, and ~0.30Δ at ±6.00 D, both within the above tolerances [
53]. Interpreting our PD difference (1.13 mm) as ~0.56 mm per eye when symmetrically distributed, the implied unwanted prism at typical powers remains within tolerance. Therefore, the observed PH and PD discrepancies are clinically acceptable under standard centration and prism criteria, while we note that higher-powered prescriptions reduce the decentration margin (per Prentice’s rule). Outliers (PD in groups 16 and 29; PH-L in groups 12 and 22) were analyzed for capture artifacts (lighting/gaze/pose) and remain within the 95% Bland–Altman limits (
Figure 20), At the same time, the relative error of this method is controlled between 3% and 4% (
Figure 21).
Upon conducting an exhaustive analysis of the experimental data, our visual intelligent measurement system has exhibited commendable precision, underscoring its substantial practical utility in the realm of pupillary height and distance measurement. Future research endeavors should concentrate on refining the algorithms and measurement protocols of the intelligent system. This optimization aims to curtail measurement errors and bolster its precision and dependability across diverse scenarios. The experimental outcomes affirm that this intelligent detection system has attained a high caliber of precision and practicality.
4. Discussion
This work presents a vision-based system for the automatic measurement of pupillary height (PH) and pupillary distance (PD) that integrates a two-layer Ensemble of Regression Tree (ERT) for robust landmarking, direction-aware ray casting guided by eye-corner orientation to recover informative iris edges under eyelid occlusion, an edge-side-stratified RANSAC followed by least-squares circle fitting for sub-pixel pupil-center refinement, and a PnP-based head-pose gate to exclude off-frontal frames prior to measurement. Relative to representative eye-center localization baselines (gradient-driven supervised regression, gaze-pipeline heuristics, and Snakuscule active contours), the proposed pipeline differs at three critical stages—coarse seeding, edge evidence gathering, and hypothesis sampling—and produces higher accuracy on BioID under the same NME protocol (e ≤ 0.10: 97.1% vs. 92.2–93.6%; e ≤ 0.05: 85.8% vs. 77.6–85.6%;
Table 3), indicating that the gains arise from algorithmic design rather than additional training alone.
In this study, robustness refers to (i) stability of the automatic measurements across subjects and repeated captures, (ii) tolerance to moderate variation in head pose and illumination, and (iii) agreement with an external clinical reference. We quantify robustness using Bland–Altman bias and 95% limits of agreement, Pearson correlation, mean absolute error and RMSE against i.Terminal2, as well as cross-dataset landmark accuracy on 300W and WFLW, which include challenging pose, blur, occlusion, and lighting conditions. Importantly, the observed errors are clinically contextualized. The average relative error was 2.51% for the left eye and 2.96% for the right eye. In absolute terms, mean absolute differences versus i.Terminal2 were 0.73 mm (PH-L), 0.89 mm (PH-R), and 1.13 mm (PD) (
Table 7). Tolerances commonly applied to mounted lenses specify ±1.0 mm for segment/fitting—cross vertical location per lens and limits on unwanted prism of approximately were ≤0.67Δ (horizontal) and ≤0.33Δ (vertical) at the prism reference point (ANSI Z80.1; ISO 21987). By Prentice’s rule (Δ = F·c, c in cm), a 0.5 mm monocular centration error (0.05 cm) in a ±3.00 D lens induces ~0.15Δ and ~0.30Δ at ±6.00 D—within those limits. Interpreting our PD difference (1.13 mm) as ~0.56 mm per eye when symmetrically distributed, the implied unwanted prism at typical powers remains within tolerance. Therefore, the PH and PD discrepancies observed here are consistent with clinical acceptability under standard centration and prism criteria, although higher-powered prescriptions naturally reduce the decentration margin. Outliers (PD in groups 16 and 29; PH-L in groups 12 and 22) remained within the 95% Bland–Altman limits (
Figure 19) and were traceable to capture artifacts.
The dominant sources of error were (i) suboptimal illumination (noise, overexposure, or flare) that degrades iris–sclera and rim contrast and weakens edge evidence for the ray-casting stage and (ii) residual pose or fixation deviations (slight head tilt or gaze offset) that can bias monocular centration despite pose gating. Retakes with improved lighting and explicit fixation cues reduced the tails of the error distribution without materially changing the mean.
We also recognize potential bias and generalizability limitations. Public landmark datasets used (e.g., 300W, WFLW, BioID) offer pose/expression diversity but lack explicit skin-tone labels and likely under-represent darker skin tones and low-contrast irides, which may affect pupil/iris edge detection and lens-rim segmentation in low-luminance conditions. To mitigate this, future work will expand data collection across Fitzpatrick skin-tone groups and eyewear styles, apply targeted photometric augmentation (luminance/contrast/color jitter) tailored to low-contrast irides, and include external validation on demographically diverse cohorts. Regarding age, the system was not validated in children and is not claimed for pediatric fitting. Pediatric PD varies with age, and cooperation is challenging; planned adaptations include animated fixation targets, short-burst multi-frame capture with temporal stabilization, higher-resolution eye crops for small faces, and pose-gated retakes, with performance reported and stratified by age. Although the pipeline was designed to mitigate glare and contrast loss (via photometric augmentation, edge-aware rim extraction, and pose gating), all quantitative comparisons with the i.Terminal2 were performed under controlled indoor lighting. Therefore, robustness to challenging real-world illumination (e.g., strong reflections, backlighting, outdoor lighting) has not yet been fully validated and remains part of future work.
Practically, the proposed pipeline achieves accuracy comparable to a high-end commercial device while operating with a single camera and commodity compute, enabling a camera-only workflow on commodity hardware that may reduce equipment cost and streamline in-store or tele-optometry capture. Pose gating offers an operational safeguard by prompting retakes when Euler angles exceed thresholds. Remaining limitations include dataset demography and pediatric validation, residual sensitivity to lighting and partial occlusion, and the need for stronger shape/edge priors at the rim–skin interface. Future work will explore transformer/self-attention backbones and graph-enhanced shape priors to further suppress edge diffusion, super-resolution and uncertainty-aware circle fitting to stabilize the pupil center under occlusion, and binocular/multi-camera capture with automated, periodic calibration and on-line illumination normalization to improve robustness across environments. Finally, we plan to release a demographically annotated validation set and report stratified metrics (skin tone, age band, eyewear type) to enable transparent assessment of fairness and generalizability.
While the proposed system achieves clinically acceptable agreement with a high-end commercial device using only a single off-the-shelf camera, several limitations remain. First, performance still depends on illumination quality: glare on the spectacle lens, strong reflections, or underexposure can weaken iris–sclera contrast and degrade RC-BlendMask rim extraction, increasing PH/PD variability. Second, although we apply pose gating via PnP and Euler-angle thresholds, residual head tilt or off-axis gaze can bias monocular centration, and the need to discard off-frontal frames may reduce first-pass capture efficiency in realistic retail environments. Third, our validation cohort (30 participants) did not explicitly stratify for skin tone, iris pigmentation, eyewear style, or pediatric patients; as a result, generalizability to darker irides, highly reflective coatings, or uncooperative children has not yet been established. Finally, we have not exhaustively characterized extreme prescriptions (very high diopters or strong wraparound frames), where decentration tolerances tighten.
Future work will expand the demographic and optical diversity of the dataset, incorporate illumination normalization and glare suppression into the capture pipeline, and explore lightweight multi-view or binocular capture to further stabilize depth and rim geometry without significantly increasing system cost. We will also investigate pediatric-specific protocols (shorter capture windows, animated fixation targets) and report stratified performance metrics for fairness and clinical applicability.