Expectation–Maximization Method for RGB-D Camera Calibration with Motion Capture System

Lin, Jianchu; Du, Guangxiao; Zhang, Yugui; Zhao, Yiyan; Xie, Qian; Yao, Jian; Khadka, Ashim

doi:10.3390/photonics13020183

Open AccessArticle

Expectation–Maximization Method for RGB-D Camera Calibration with Motion Capture System

by

Jianchu Lin

^1,2,*

,

Guangxiao Du

¹,

Yugui Zhang

³

,

Yiyan Zhao

¹,

Qian Xie

¹,

Jian Yao

⁴ and

Ashim Khadka

⁵

¹

Faculty of Computer and Software Engineering, Huaiyin Institute of Technology, Huai’an 223001, China

²

Graduate School, Rajamangala University of Technology Thanyaburi, Khlong Luang, Pathum Thani 12110, Thailand

³

Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100049, China

⁴

School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China

⁵

Faculty of Science and Technology, Nepal College of Information Technology, Lalitpur 44600, Nepal

^*

Author to whom correspondence should be addressed.

Photonics 2026, 13(2), 183; https://doi.org/10.3390/photonics13020183

Submission received: 22 December 2025 / Revised: 29 January 2026 / Accepted: 9 February 2026 / Published: 12 February 2026

(This article belongs to the Special Issue Next-Generation Optical Transmission Systems: Breakthroughs, Technologies, and Emerging Frontiers)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Camera calibration is an essential research direction in photonics and computer vision. It achieves the standardization of camera data by using intrinsic and extrinsic parameters. Recently, RGB-D cameras have been an important device by supplementing deep information, and they are commonly divided into three kinds of mechanisms: binocular, structured light, and Time of Flight (ToF). However, the different mechanisms cause calibration methods to be complex and hardly uniform. Lens distortion, parameter loss, and sensor degradation et al. even fail calibration. To address the issues, we propose a camera calibration method based on the Expectation–Maximization (EM) algorithm. A unified model of latent variables is established for the different kinds of cameras. In the EM algorithm, the E-step estimates the hidden intrinsic parameters of cameras, while the M-step learns the distortion parameters of the lens. In addition, the depth values are calculated by the spatial geometric method, and they are calibrated using the least squares method under an optical motion capture system. Experimental results demonstrate that our method can be directly employed in the calibration of monocular and binocular RGB-D cameras, reducing image calibration errors between 0.6 and 1.2% less than least squares, Levenberg–Marquardt, Direct Linear Transform, and Trust Region Reflection. The deep error is reduced by 16 to 19.3 mm. Therefore, our method can effectively improve the performance of different RGB-D cameras.

Keywords:

RGB-D camera; Time-of-Flight camera; structured-light camera; camera calibration; Expectation–Maximization algorithm

1. Introduction

RGB-D cameras have attracted increasing attention and have been widely applied in computer vision tasks, such as object detection [1,2,3], depth estimation [4,5,6], image processing [7,8,9], Structure from Motion (SfM) [10,11], Simultaneous Localization and Mapping (SLAM) [12,13], and so on. As an essential step in standardizing the data quality of RGB-D cameras, camera calibration makes it possible to acquire more accurate images and depth information. RGB-D cameras are developed in various ways, such as monocular depth sensors, and binocular stereo vision technology. However, the variety of camera types results in many calibration methods [14], and these methods are often limited to specific types of cameras in application, which restricts their widespread use. Therefore, RGB-D camera calibration methods are rarely unified. In addition, RGB-D cameras face challenges similar to those of conventional RGB cameras, such as lens distortion, camera intrinsic parameter loss, and degeneration of electronic components, which result in failures of image data standardization.

RGB-D cameras are primarily implemented through binocular cameras, structured light, and ToF technologies [15]. Binocular cameras reconstruct scene depth by capturing paired RGB images with dual lenses and estimating disparity through stereo matching, typically selecting one view as the RGB image. This approach requires accurate calibration of the dual lenses, and the disparity matching depends upon the texture information in the images. Due to the complex mechanism of binocular camera systems, unified calibration is difficult to achieve. To compensate for the lack of texture information, structured light technology is often integrated to augment texture details missing from the original. Depth information can be estimated by projecting structured lights and analyzing their deformations [16]. However, structured lights can lead to data loss due to specular reflections on smooth surfaces. ToF cameras calculate depth distance by measuring the round-trip time or phase delay of light signals [17,18]. RGB information is usually acquired by an additional monocular camera to form RGB-D data. Despite the advantages of ToF, such as high accuracy and fast response, the technology suffers from poor anti-interference capability, making it prone to losing information at long distances [19]. Based on the above analysis, RGB-D cameras can be divided into two independent parts: depth data and RGB data. The depth data collects the straight-line distance between the lens center and the scene point, while the calibration of RGB image data lacks an effective unified model.

RGB-D cameras face issues similar to those of RGB cameras, where intrinsic parameters are uncertain due to lens distortion, loss of parameters, and degeneration of sensor components [20,21]. To calibrate intrinsic and extrinsic parameters separately, Wu and Zhu [17] proposed a calibration approach that bridges traditional calibration and self-calibration, addressing recalibration after camera movement. To improve the accuracy of binocular camera calibration, Yin et al. [22] proposed to use a single calibration image and a coded planar target, through multi-constraint optimization on the condition of missing camera information. The related research above primarily focuses on fitting parameters based on the pinhole model. The parameters are mainly divided into two independent parts: hardware-related parameters such as focal length, and lens distortion correction parameters. Both are mutually dependent in the calibration process.

From the above analysis, it can be seen that the RGB-D camera can be calibrated in a two-stage model; one stage is for the RGB camera, and the other is for the depth camera. In this way, we proposed a general RGB-D camera calibration method based on the Expectation–Maximization (EM) and depth correction algorithms, which can be applied consistently across different representative RGB-D cameras without introducing camera-specific modeling or parameter tuning. This design helps alleviate inconsistencies in model parameters across different types of depth cameras. We utilized the EM algorithm to estimate the hardware and distortion parameters separately, achieving the first-stage calibration of the RGB image. In the second stage, the obtained parameters were employed to convert the depth information of the scene, and the depth was calibrated using least squares regression under a motion capture system. This method is suitable for motion capture–based applications such as human pose estimation and behavior analysis. The EM algorithm is adopted because its probabilistic optimization naturally handles incomplete or noisy observations, enabling robust parameter estimation even under real-world noise and minor outliers. The main contributions of this paper include the following:

A unified RGB-D camera calibration framework has been established to solve the calibration problem of RGB-D cameras. It can calibrate different types of RGB-D cameras, such as monocular and binocular cameras, in a unified way.
A camera calibration method is proposed based on the EM algorithm, which simultaneously calculates hardware parameters and lens distortion parameters. This method can efficiently improve the accuracy of camera calibration.
The depth data is calibrated under a motion capture system, transferring the lens’s straight-line depth to spatial depth, which provides a new solution for calibration of depth data.

The overview of this paper is organized as follows: Section 2 reviews related work on traditional, learning-based, and RGB-D calibration methods; Section 3 presents equipment, data, and the proposed EM-based RGB calibration and depth correction framework; Section 4 describes the experimental setup and comparative evaluation results; and Section 5 concludes the study and outlines future work.

2. Related Work

2.1. Traditional Universal Camera Calibration Methods

Traditional universal camera calibration methods aim to develop flexible imaging models. Early research [23,24,25,26] focused on the pinhole model [27]. Zhang’s groundbreaking method [28], which is based on the observation of planar patterns, combines closed-form solutions with nonlinear optimization techniques grounded in maximum likelihood criteria, thereby greatly simplifying the camera calibration process. Besides the pinhole model, imaging models also include fisheye models, catadioptric models, and others, leading to calibration methods for highly distorted conditions [23,24,29]. The rational function model [30] has been adopted by OpenCV due to its efficiency and versatility in extreme distortion environments such as fisheye and catadioptric cameras. Sturm and Ramalingam [24] integrated multiple imaging models and proposed a universal calibration method suitable for central, non-central, and axial cameras. Additionally, some studies returned to the essence of optical imaging. Non-parametric methods were directly used to calibrate image distortions. These methods established mapping relationships between positions and pixels, and simplified parameter calculation and calibration steps [31,32]. Grossberg and Nayar [23] introduced the concept of a raxel, a virtual mapping from light rays to pixels, and constructed an ideal optical model. Bergamasco et al. [33] proposed a general unconstrained nonlinear model method. A unified calibration center was enforced for non-central cameras, and the method was achieved by constraining light rays through the optical center. These methods provide flexible and general approaches for image distortion correction but have drawbacks, including high computational cost, limited generalization, and increased memory requirements.

2.2. Learning-Based Camera Calibration Methods

With the development of deep learning, learning-based camera calibration methods have become popular. There are two mainstream approaches: one is incorporating image models into loss functions or network modules to estimate distortion parameters [34,35,36,37,38]. Rong et al. [37] used CNNs to train synthetic image data for radial distortion calibration, proving that distorted images from pinhole camera models can be used to train a CNN to estimate wide-angle lens distortion. The other is learning universal distortion calibration maps or training distortion calibration networks [36]. However, there are also some limitations. For example, image distortions make the problem difficult to represent and describe accurately, and they affect the accuracy of feature extraction and training. Liao et al. [39] addressed this issue by proposing an ordinal distortion estimation method, which improves the model’s ability to represent and learn distortion features by modeling distortion information. Due to the scarcity of distortion annotation data, synthetic images generated by algorithms are often used as substitutes, but these images can only simulate distortions and cannot reflect real distortion characteristics [40].

2.3. RGB-D Camera Calibration

The calibration of RGB-D cameras needs simultaneous consideration of both RGB lens calibration and depth information alignment. For binocular cameras, it is essential to calibrate and match the two camera lenses while accounting for the disparity. Yang et al. [41] proposed a universal calibration algorithm for two uncalibrated images. The Epipolar geometry constraints were utilized to obtain more accurate disparity information. Zhou et al. [42] used chessboards to calibrate under auxiliary infrared lighting. They calculated the intrinsic and extrinsic parameters of the depth camera by Zhang’s method, and directly calibrated the distortion of depth maps based on depth correction. This approach is more intuitive and faster than these distortion correction models. However, these methods are tailored to specific types of RGB-D cameras, lacking universal applicability. Some research has explored methods for unifying the calibration of various RGB-D cameras. Basso et al. [43] proposed a two-part error model that unified the error sources of RGB-D pairs. The model can be used for structured light 3D cameras and ToF cameras, which are widely used in robotic scenarios. Ramírez-Hernández et al. [44] used the least squares method to model errors by converting pixel points into angular information to find accurate 3D points. However, these methods have limited error correction capabilities for depth information and fail to accurately recover depth information.

3. Method

3.1. Equipment and Data

Two types of common RGB-D cameras and an optical motion capture system were employed for evaluating our calibration algorithm. The RGB-D cameras were designed and manufactured in different modes; one is the binocular camera with structured light, and the other is the ToF camera. Their common intrinsic camera parameters are presented in Table 1. The optical motion capture system is used to collect the ground truth of position in a 3D space.

Camera 1 (Cam 1): Intel RealSense D455 (Figure 1), employs binocular vision technology combined with active stereo vision to capture depth information. The paired images are obtained by infrared lenses located on the two sides of the camera, and the depth map is calculated corresponding to each pixel of the RGB images. Its depth measurement range spans from 0.6 m to 6 m. The RGB and depth output resolutions were set to 640 × 480, with a capture frame rate of 15 FPS.

Camera 2 (Cam 2): The Orbbec Femto Bolt (Figure 2) utilizes Time-of-Flight (ToF) sensing technology and integrates multiple sensing modules, including a multi-mode depth camera and an RGB video camera. The RGB and depth resolutions were set to 1920 × 1080, with a capture frame rate of 15 FPS.

Motion Capture System: The Optitrack motion capture system, can provide positional information of position points within the camera’s field of view, thereby providing the necessary 3D reference space for the calibration of depth cameras. The system employs an infrared light-sensitive camera group consisting of 10 cameras, each with a lens resolution of up to 4.2 million pixels. It offers precise measurements within a spatial area of 5 m × 3.2 m × 2.5 m, with a capture accuracy of 0.1 mm. The system is equipped with advanced software for real-time acquisition of positional data. Figure 3 shows the structure schematic diagram of our system.

Dataset: Each camera collected a dataset. The dataset consists of a training set with 27 position points and a testing set with 81 points (Figure 4). The dataset also includes the cameras’ extrinsic parameters, used to transform points from their original coordinates to the camera coordinate system. In the training set, the positions of points are distributed at various distances (far, medium, near), heights (top, middle, bottom), and locations (left, middle, right), basically covering the entire acquisition 3D space. In the testing set, points are randomly distributed. During the collection, a motion capture marker was attached to a camera stand. The height and location of the marker were changed by adjusting the camera stand. When collecting training set data, the marker was moved in a similar interval to cover the entire data collection space. For test set data, we randomly change the positions of the marker. When the position of the marker was determined by the motion capture system, the corresponding RGB image and depth map were captured by our collection software. These data can be disclosed via correspondence email.

In Figure 5, Rows 1 and 3 display RGB images of three instance points (near, medium, and far) captured by two cameras at the same height (low height) and the same position (middle position), respectively. Row 2 and Row 4 then present the corresponding depth maps for these RGB images. Cam 1, equipped with binocular infrared cameras, has a far measurement range but relatively low precision. It can capture markers and the camera stand, but objects at a greater distance, such as walls, appear blurred depth-wise, failing to exhibit abrupt changes in depth information at object edges. Cam 2, using a Time-of-Flight (ToF) sensor, has a limited measurement range, assigning a depth of 0 to areas beyond this range. However, it boasts high precision, with objects in the depth map differentiated by distinct boundaries.

3.2. EM Algorithm for RGB Camera Calibration

The EM algorithm, as an iterative method for estimating parameters, is primarily used to address the maximum likelihood estimation of parameters in probabilistic models with hidden variables. Ultimately, the iterative process converges both parameter sets to the optimal solution [45]. The EM algorithm consists of an E-step and an M-step. In our study, the Expectation step (E-step) transforms the intrinsic camera parameters, such as focal length and image sensor size, into hidden variables

θ_{x}

and

θ_{y}

and computes them. The Maximization step (M-step) re-estimates the distortion parameters K using the hidden variables calculated from the E-step.

To place the above procedure into a probabilistic EM framework, a likelihood model is induced by the camera imaging function. For a 3D point

P_{c}^{i}

observed as a 2D pixel

P_{p}^{i}

, the imaging model provides the predicted pixel location.

{\hat{P}}_{p}^{i} = f (P_{c}^{i}; θ, K),

(1)

where

θ = (θ_{x}, θ_{y})

denotes the latent variables related to the intrinsic imaging properties, and

K = (k_{1}, k_{2}, k_{3}, k_{4})

denotes the distortion parameters.

Based on the reprojection consistency principle, the conditional likelihood of observing

P_{p}^{i}

given

P_{c}^{i}

,

θ

, and K is defined as a monotonic decreasing function of the reprojection error:

p (P_{p}^{i} ∣ P_{c}^{i}, θ, K) \propto exp (- {∥P_{p}^{i} - f (P_{c}^{i}; θ, K)∥}^{2}) .

(2)

Let

D = {(P_{c}^{i}, P_{p}^{i})}_{i = 1}^{N}

denote the observed dataset. By augmenting the observed data with the latent variables

θ

, the complete-data likelihood can be written as Equation (3).

p (D, θ ∣ K) = p (θ) \prod_{i = 1}^{N} p (P_{p}^{i} ∣ P_{c}^{i}, θ, K),

(3)

In Equation (3), a non-informative prior is assumed for

θ

. Under this formulation, maximizing the complete-data log-likelihood is equivalent to minimizing the overall reprojection error used in this paper.

The E-step consists of estimating the latent variables

θ

by maximizing the expected complete-data log-likelihood under the current distortion parameters

K^{(t)}

. With the likelihood defined by the reprojection error and a non-informative prior on

θ

, this Expectation–Maximization reduces to a deterministic optimization problem. Specifically, the E-step update can be written as

θ^{(t + 1)} = arg min_{θ} \sum_{i = 1}^{N} {∥P_{p}^{i} - f (P_{c}^{i}; θ, K^{(t)})∥}^{2},

(4)

which is exactly the objective used for updating

θ_{x}

and

θ_{y}

in the following E-step derivation.

In our method, the principal point is fixed at (w/2, h/2) because, for engineering stability and parameter identifiability, joint optimization with other parameters offers little accuracy gain and may reduce numerical stability.

3.2.1. Expectation Step (E-Step)

The camera projection can be described as a linear perspective projection. For the 3D position points and 2D pixel points

i = 1, 2, 3, \dots, n

, the projection can be shown as Equation (5):

(\begin{matrix} x_{u} \\ y_{u} \end{matrix}) = (\begin{matrix} \frac{x_{c}}{z_{c}} f_{x} + \frac{w}{2} \\ \frac{y_{c}}{z_{c}} f_{y} + \frac{h}{2} \end{matrix})

(5)

Camera intrinsic parameters include lens focal lengths

f_{x}

,

f_{y}

, sensor size parameters

s_{x}

,

s_{y}

, image width and height

w

,

h

. However, when the camera intrinsic parameters are unknown, it is hard to employ Equation (5). To solve this issue and enable calculation between distortion parameters and intrinsic camera parameters, we further transformed

f_{x}

,

f_{y}

,

s_{x}

,

s_{y}

,

w

, and

h

into a set of hidden variables

θ_{x}

and

θ_{y}

. Thus, the transformation between

P_{ui}

and

P_{ci}

can be revised as Equation (6):

(\begin{matrix} x_{u i} \\ y_{u i} \end{matrix}) = (\begin{matrix} \frac{x_{c i}}{z_{c i}} θ_{x} + \frac{w}{2} \\ \frac{y_{c i}}{z_{c i}} θ_{y} + \frac{h}{2} \end{matrix})

(6)

For lens distortion, there exists a deviation between the ideal pixel point

P_{ui}

and the actual pixel point

P_{pi}

. The transformation between

P_{ui}

and

P_{pi}

is described by Equation (7):

(\begin{matrix} x_{p i} \\ y_{p i} \end{matrix}) = (\begin{matrix} (k_{1} + k_{3} r_{i}^{2}) x_{u i} \\ (k_{2} + k_{4} r_{i}^{2}) y_{u i} \end{matrix})

(7)

Here,

k_{1}

,

k_{2}

,

k_{3}

, and

k_{4}

are the radial distortion parameters.

r_{i}

denotes the center radial distance, defined as the Euclidean distance from the pixel point

P_{ui}

to the distortion center

(w / 2, h / 2)

. This model differs from the conventional one as it uses a low-order, pixel-coordinate parameterization to reduce parameters and improve optimization stability, but the standard normalized-plane model is also applicable and yields similar calibration accuracy in our scenario. Substituting Equation (6) into Equation (7). yields the transformation relationship between the 3D position points

P_{ci}

and the real pixel points

P_{pi}

, as shown in Equation (8):

(\begin{matrix} x_{p i} \\ y_{p i} \end{matrix}) = (\begin{matrix} (k_{1} + k_{3} r_{i}^{2}) (\frac{x_{c i}}{z_{c i}} θ_{x} + \frac{w}{2}) \\ (k_{2} + k_{4} r_{i}^{2}) (\frac{y_{c i}}{z_{c i}} θ_{y} + \frac{h}{2}) \end{matrix})

(8)

The parameters

θ_{x}

and

θ_{y}

can be calculated using n points, as shown in Equation (9):

(\begin{matrix} θ_{x} \\ θ_{y} \end{matrix}) = (\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} (\frac{x_{p i}}{k_{1} + k_{3} r_{i}^{2}} - \frac{w}{2}) \frac{z_{c i}}{x_{c i}} \\ \frac{1}{n} \sum_{i = 1}^{n} (\frac{y_{p i}}{k_{2} + k_{4} r_{i}^{2}} - \frac{h}{2}) \frac{z_{c i}}{y_{c i}} \end{matrix})

(9)

Since the radial distortion parameters are unknown, the initial values of the latent variables

θ_{x}

and

θ_{y}

can be obtained through random initialization or an approximate solution, as shown in Equation (10):

(\begin{matrix} θ_{x} \\ θ_{y} \end{matrix}) = (\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} (x_{p i} - \frac{w}{2}) \frac{z_{c i}}{x_{c i}} \\ \frac{1}{n} \sum_{i = 1}^{n} (y_{p i} - \frac{h}{2}) \frac{z_{c i}}{y_{c i}} \end{matrix})

(10)

The complete process of the E-step algorithm is presented in Table 2.

3.2.2. Maximization Step (M-Step)

Based on the calculation of the E-step, we can obtain the intrinsic latent variable parameters. Then, the distortion parameters are iteratively updated in the M-step. The radial distortion parameters include

k_{x} (k_{1}, k_{3})

and

k_{y} (k_{2}, k_{4})

, which quantify the degree of distortion along the

x

-axis and

y

-axis directions of the image, respectively. To improve computational efficiency, the M-step introduces matrices

A_{x}

,

A_{y}

and vectors

p_{x}

,

p_{y}

to estimate the parameters, as shown in Equation (11):

A_{x} = [\begin{matrix} x_{c 1} & r_{1} \\ x_{c 2} & r_{2} \\ ⋮ & ⋮ \\ x_{c n} & r_{n} \end{matrix}], A_{y} = [\begin{matrix} y_{c 1} & r_{1} \\ y_{c 2} & r_{2} \\ ⋮ & ⋮ \\ y_{c n} & r_{n} \end{matrix}], p_{x} = [\begin{matrix} x_{p 1} \\ x_{p 2} \\ ⋮ \\ x_{p n} \end{matrix}], p_{y} = [\begin{matrix} y_{p 1} \\ y_{p 2} \\ ⋮ \\ y_{p n} \end{matrix}]

(11)

The matrices

A_{x}

and

A_{y}

are constructed based on the radial distances

r_{i}

and the position points

P_{ci}

, while

p_{x}

and

p_{y}

represent the vectors composed of the

x

and

y

coordinates from n pixel points. The distortion parameters

k_{x}

and

k_{y}

are calculated using

θ_{x}

,

θ_{y}

,

A_{x}^{- 1}

,

A_{y}^{- 1}

,

p_{x}

,

p_{y}

,

w

, and

h

, as shown in Equation (12):

(\begin{matrix} k_{x} \\ k_{y} \end{matrix}) = (\begin{matrix} \frac{A_{x}^{- 1} (p_{x} - \frac{w}{2})}{θ_{x} w} \\ \frac{A_{y}^{- 1} (p_{y} - \frac{h}{2})}{θ_{y} h} \end{matrix})

(12)

The matrices

A_{x}^{- 1}

and

A_{y}^{- 1}

denote the pseudo-inverses of

A_{x}

and

A_{y}

, whose dimensions are

2 \times n

, while the vectors

p_{x}

and

p_{y}

are n-dimensional. The complete process of the Maximization step algorithm is presented in Table 3.

The computational complexity of the proposed EM procedure is linear in the number of observations. In the Expectation step, the dominant cost arises from iteratively computing radial distances and updating distortion parameters for all points, resulting in a time complexity of

O (T \cdot n)

with a space complexity of

O (n)

. The Maximization step constructs a small linear system from the same set of observations, leading to a time complexity of

O (n)

and a space requirement of

O (n)

. Consequently, the overall EM calibration process exhibits efficient linear scalability while maintaining modest memory usage.

3.3. Depth Correction

The depth of the position point captured by RGB-D cameras is the straight-line distance between the depth sensor and these position points. The relationship can be represented by Euclidean distance, as shown in Equation (13):

d = \sqrt{{(x - x_{0})}^{2} + {(y - y_{0})}^{2} + {(z - z_{0})}^{2}}

(13)

where

(x, y, z)

represents the position point in the motion capture system, and

(x_{0}, y_{0}, z_{0})

represents the position of the camera, which is recorded by the motion capture system as the extrinsic parameters. The variable d denotes the straight-line distance (depth) from the camera to the position point.

The coordinates

(x, y)

can be calibrated by utilizing the parameters obtained from the EM algorithm, as shown in Equation (14):

(\begin{matrix} x \\ y \end{matrix}) = (\begin{matrix} (\frac{x_{c i}}{k_{1} + k_{3} r^{2}} - \frac{w}{2}) \frac{z_{c i}}{θ_{x}} \\ (\frac{y_{c i}}{k_{2} + k_{4} r^{2}} - \frac{h}{2}) \frac{z_{c i}}{θ_{y}} \end{matrix})

(14)

The remaining value

z

in Equation (13) can be calculated using Equation (15). Since the camera is placed in the forward direction, the depth value

z

within the camera’s field of view should be smaller than

z_{0}

. Therefore, the negative sign of the square root should be selected.

z = - \sqrt{d^{2} - {(x - x_{0})}^{2} - {(y - y_{0})}^{2}} + z_{0}

(15)

After transforming the straight-line distances into depth coordinates, we further employ the least-squares method to calibrate the depth values in the motion capture system, as shown in Equation (16):

(\begin{matrix} X \\ Y \\ Z \end{matrix}) = (\begin{matrix} w_{x} x + b_{x} \\ w_{y} y + b_{y} \\ w_{z} z + b_{z} \end{matrix})

(16)

where

(x, y, z)

represent the back-projected positions of the points obtained from Equations (14) and (15), while

(X, Y, Z)

denote the original positions in the motion capture system. The parameters

w_{x}

,

w_{y}

,

w_{z}

,

b_{x}

,

b_{y}

, and

b_{z}

represent the weight parameters.

Figure 6 shows our method’s overall calibration workflow. The calibration process can be divided into two parts. The first part involves applying the EM algorithm to calibrate the 2D-pixel points. The second section focuses on transforming and calibrating depth information.

4. Results and Discussion

The experiment was conducted in two parts: (1) The pixel calibration in RGB images—the EM algorithm was compared with other similar methods to assess its feasibility and accuracy. In addition, we studied the impact of different initialization methods on the results of the EM algorithm. (2) The depth value calibration: The depth values were transformed and calibrated corresponding to the pixel points.

4.1. Evaluation Criteria

The residue between the calibrated pixel points and the real pixel points is used as a metric for evaluating calibration accuracy. Two equations are employed to evaluate the performance. For qualitative analysis, we calculate the relative pixel points

P_{relative}

by subtracting the original pixel points from the calibrated ones, as shown in Equation (17). For quantitative analysis, the Root Mean Square Error (RMSE) is adopted to compute the residue, as shown in Equation (18).

p_{relative} = p_{inference} - p_{real}

(17)

RMSE = \sqrt{\frac{1}{m} \sum_{i = 1}^{m} {∥p_{inference, i} - p_{real, i}∥}^{2}}

(18)

Here,

P_{real}

represents the original pixel points, and

P_{inference}

represents the calibrated pixel points.

For depth error measurement, the average absolute error is used to evaluate the calibration accuracy, as shown in Equation (19):

{Error}_{depth} = |D_{gt} - D|

(19)

Here,

D_{gt}

denotes the ground-truth depth, and

D

represents the estimated depth.

4.2. Pixel Point Calibration for RGB Image

To validate the effectiveness of the Expectation–Maximization (EM) algorithm in the two cameras, we compared it with Direct Linear Transformation (DLT) [46], Linear Least Squares (LS) [47], Levenberg–Marquardt algorithm (LM) [48], Trust Region Reflective algorithm (TRF) [49], and our Expectation–Maximization (EM) algorithm. Among these methods, the Levenberg–Marquardt (LM) algorithm is commonly used in bundle adjustment (BA) [50] and related optimization-based calibration approaches. In our experiments, the triangulation step in BA is omitted, and the optimization is performed directly using the 3D marker positions provided by the motion capture system. The dataset is divided into 27 training points and 81 testing points, yielding a roughly 1:3 ratio. This setting is adopted to more clearly distinguish the performance differences among various methods under a limited number of markers. In practical engineering applications, a larger number of markers is typically used.

To ensure fair comparison, all methods were evaluated under identical experimental settings. For the EM algorithm, the iteration limit was set to 10 with a convergence tolerance of

1 \times 10^{- 6}

, using standard pinhole estimates for intrinsic initialization and zero initialization for distortion parameters. The optimization-based baselines (LS, LM, TRF) adopted the same initialization, each with a maximum of 20 iterations. All experiments used the same data split, identical image resolution and preprocessing, and independent evaluation on the two cameras. Runtime was measured under single-thread CPU execution for consistency. The results are shown in Figure 7. In the dataset, All the pixel points were displayed in plots.

In Figure 7, from top to bottom, the methods are DLT (blue), LS (pink), LM (orange), TRF (purple), and our EM algorithm (red). In these methods, LS, LM, TRF, and our EM algorithm belong to iterative algorithms. In the training set (row 1 and row 2), results in DLT, LS, and EM are similar, and the calibrated pixel points are close to real pixel points. LM and TRF methods are relatively inferior, with significant residues between the calibrated points and the actual points. In the test set (row 3 and row 4), DLT and EM continue to maintain a small residue. The performance of LS begins to decline, and the residues of LM and TRF further increase.

In Figure 8, it can be seen that the scatter points closer to the center

(0, 0)

indicate smaller errors. The distance between each adjacent concentric circle is 1 pixel. By comparison, pixels of our EM algorithm (red) were observed to be closer to the center compared to the other methods. Their errors in the x-axis direction are smaller compared to those in the y-axis direction. The results obtained by the DLT and LS methods are similar, but both are inferior to our method. Their pixel points surround those obtained by our method, with an error of approximately one pixel. The pixel points of LM and TRF are relatively poor, with some points falling outside the range of our comparison. We counted the number of points with errors < 2 pixels, 4 pixels, and 6 pixels for different methods, as shown in Table 4.

The error reported in Table 4 is the pixel reprojection error, defined as the Euclidean distance between the corrected pixel locations and the ground-truth image observations, as formulated in Equation (18). It should be noted that RMSE is a pixel-domain metric and is sensitive to image resolution. Therefore, the absolute RMSE values are not strictly comparable across cameras with different resolutions. In Cam 1, all 27 pixel points by EM exhibit errors < 4 pixels, outperforming DLT (26), LS (23), LM (21), TRF (22). In Cam 2, due to its higher resolution, its average error is also higher compared to Cam 1. Out of the 27 points, 26 points have errors < 6 pixels, better than DLT (22), LS (20), LM (16), TRF (17). Under identical experimental conditions, the error of the points obtained through our method is significantly smaller than that of other methods. Moreover, in our experiments, linear methods such as LS outperform nonlinear ones because, although the RGB images were uncorrected, factory calibration of the RGB-D cameras ensures that the images have minimal distortion, making the imaging model approximately linear and allowing LS to perform well under these conditions.

Table 5 presents the overall error of each method on two cameras calculated by Equation (18). The RMSE of the EM algorithm on the datasets from Cam 1 and Cam 2 are 1.26 pixels and 3.22 pixels, respectively. Within the same camera, these errors are significantly better than those of other methods. The results for LS are 1.47 (in Cam 1) and 4.73 (in Cam 2), and for DLT are 1.26 (in Cam 1) and 4.02 (in Cam 2). Theoretically, since DLT does not take distortion information into account, its accuracy should be inferior to other methods. However, the results indicate that its performance is comparable to LS, and even surpasses LS in Cam 1. The RMSE of LM and TRF algorithms exhibit substantial discrepancies compared to our method, with differences exceeding an amplitude of 10 pixels. The runtime of the EM algorithm is 0.007 s in 10 iterations, outperforming the LM (1.802 s, 1.871 s), LS (1.494 s, 0.083 s), and TRF (1.011 s, 0.370 s) methods. Our method’s running speed is the fastest among iterative methods (LS, LM, and TRF), second only to the non-iterative DLT algorithm (0.001 s, 0.001 s). In general, the experimental results indicate that the EM algorithm exhibits the best overall performance among the compared methods for both cameras.

4.3. Initialization of Latent Variable Parameters

The EM algorithm is an iterative method, and highly sensitive to initial parameters. We compared two initialization methods to validate the effectiveness of these initialization methods for the EM algorithm: (1) Random initialization. This method utilized 10 random numbers within the range from −10 to 10. (2) Initialization based on prior information. This method used data in the training set to calculate parameter

θ_{x}

,

θ_{y}

by Equation (10).

Figure 9 shows the process of two iteration methods from training data in Cam 1 (a) and Cam 2 (b), respectively. In method 1 (blue), 10 random numbers ranging from −10 to 10 ultimately converged to −0.032 (in Cam 1) and −0.064 (in Cam 2) after 20 epochs. In method 2 (red), the initial values converged to −0.031 (in Cam 1) and −0.062 (in Cam 2). The experimental data indicates that regardless of the initial value setting, both methods ultimately converge to highly similar results, with an error within 0.1%. However, under the same epochs of iterations, the initialization method based on Method 2 converges faster.

4.4. Depth Calibration

The straight-line distance depth value of the RGB camera was converted by Equation (15). Compared to the original depth, the calibrated depth values are closer to the ground truth obtained from the motion capture system. The calibration results are illustrated in Figure 10 and Figure 11.

Figure 10 exhibits the original depth (green), calibrated depth (blue), and ground truth depth of all the 27 position points in the training set for Cam 1 and Cam 2. Rows 1, 2, and 3 respectively display the depth calibration results for points at close, medium, and long distances. In the near distance (ranging from 1300 to 1800 mm in Cam 1), our methods are accurate in both Cam 1 and 2. All the calibrated points are close to the ground truth with errors less than 8 mm. In the medium distance (ranging from 2000 mm to 2700 mm in Cam 1 and 3000 mm to 3700 mm in Cam 2), the results are also accurate with errors less than 14 to 20 mm. For far distances (ranging from 3300 mm to 3800 mm in Cam 1 and 4500 mm to 5000 mm in Cam 2), some points are over-calibrated, but most of them are effectively calibrated. The results in the training set indicate that the depths of most points are closer to the ground truth values after calibration. For the test set, we represented the results of all 81 points using a box plot to analyze its scale and distribution, as shown in Figure 11.

In Figure 11, before calibration, the average depth error for Cam 1 ranges from −9.8 mm to 80.3 mm (differing by 90.1 mm). After calibration, the error ranges from −22.5 to 34.3 (differing by 56.8 mm). For Cam 2, the depth error before calibration ranges from −1.5 mm to 108.3 mm (differing by 109.8 mm). After calibration, the error ranges from −22.5 to 51.3 (differing by 73.8 mm). For Cam 1, the average absolute error was dropped from 27.0 mm to 11.0 mm. For Cam 2, it dropped from 41.0 mm to 21.7 mm.

Our method uses least squares to model the linear relationship between calibrated and original depths of the position points. The resulting parameters are applied to transform the original depth maps accordingly. Depth information is visualized using a rainbow colormap. Pixel colors transition from red to blue, representing depths from minimum to maximum. The result is shown in Figure 12.

Figure 12 shows the calibration results of Cam 1 and Cam 2. (a) presents the original RGB images of both cameras, (b) displays the original depth maps as well as their overlays with RGB channels, and (c) shows the depth maps after depth calibration, also overlaid with their RGB channels. The results indicate that the depth accuracy of both cameras has been improved through depth-to-distance conversion and least squares regression, with Cam 1 experiencing a more significant enhancement.

Figure 13 shows a comparison of local details in the depth maps of Cam 1 before and after calibration. We selected a section of the depth maps and zoomed in for comparison. In the original depth map, we observed that there was no clear edge between the background wall and the object (an air conditioner visible in the overlay image). However, after calibration, a more distinct depth discontinuity at the edge became apparent. Furthermore, in the original depth, the wall appeared yellow in Cam 1’s image but green in Cam 2’s. After calibration, the results from both cameras became more consistent (the depth of the wall in Cam 1’s image changed to green). These observations confirm that our calibrated depth data reflect spatial depth better.

4.5. Discussion

In this study, the proposed EM-based calibration method is compared with DLT, LS, LM, and TRF. The results show that the proposed method converges more stably when high-precision three-dimensional ground-truth marker points are available. It also achieves lower RMSE and depth error than the compared methods. These results indicate that incorporating camera parameters and three-dimensional point information from the motion capture system into a unified probabilistic model is effective. Optimization within the EM framework enables reliable parameter estimation in multi-type camera environments.

In addition, although some methods are not included in the experimental comparison, they are worth discussion, such as three-dimensional measurement methods based on simple calibration targets and calibration approaches based on structured light [16,17,22]. These methods are generally easy to operate, technically mature, and specifically tailored to particular types of cameras. In contrast, the proposed method is more suitable for experimental environments that inherently employ optical motion capture systems, such as three-dimensional human pose estimation or multi-view motion analysis. In such scenarios, the motion capture system can directly provide high-precision three-dimensional marker data, allowing camera calibration to be performed without additional calibration objects. Moreover, the present work is inspired by several studies that also leverage motion capture systems, in which the motion capture framework contributes to improved calibration or measurement accuracy [51,52].

5. Conclusions

The research on the application of RGB-D cameras is significant for advancing computer vision technology. It involves the calibration issues of RGB cameras and the depth information. In response to this issue, we have proposed a calibration method suitable for multiple common RGB-D camera configurations. By introducing the Expectation–Maximization algorithm and utilizing a motion capture system, spatial depth is obtained. During the calibration process, the EM method employs a learning model to swiftly determine the camera’s intrinsic parameters. The depth correction can significantly improve depth measurements at an average of 16.0 to 19.3 mm, enabling a rapid conversion to Euclidean distance measurements for depth. However, the adopted linear depth fitting model is mainly suitable for motion capture–based scenarios with controlled conditions, while outdoor or unconstrained environments may introduce additional interference. Despite this, we have proposed a rapid and efficient RGB-D depth camera calibration method that performs well on currently mainstream commercial depth cameras. Next, future work will focus on extending the applicability of the proposed method to broader scenarios and improving the generalization of depth calibration.

Author Contributions

Conceptualization, J.L. and G.D.; methodology, J.L.; algorithm design, J.L. and G.D.; experimental design, J.L., G.D. and Q.X.; software, J.L. and G.D.; validation, Y.Z. (Yugui Zhang), Y.Z. (Yiyan Zhao) and J.Y.; formal analysis, G.D., Y.Z. (Yugui Zhang) and A.K.; investigation, G.D. and J.Y.; resources, Q.X.; data curation, G.D., Y.Z. (Yiyan Zhao) and A.K.; writing—original draft preparation, J.L. and G.D.; writing—review and editing, J.L., G.D. and Y.Z. (Yugui Zhang); supervision, J.Y. and A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research and the APC were supported by the Special Funds of National Natural Science Foundation of China (No. 62341118), the Joint Funds of the National Natural Science Foundation of China (No. U24A203185), and the Nantong Youth Fund Project for Natural Science Research Projects (No. JCZ2024017).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DLT	Direct Linear Transform
BA	Bundle Adjustment
EM	Expectation–Maximization
LM	Levenberg–Marquardt
LS	Least Squares
RGB-D	Red–Green–Blue and Depth
ToF	Time of Flight
TRF	Trust Region Reflection

References

Wu, Y.; Wang, Y.; Zhang, S.; Ogai, H. Deep 3D object detection networks using LiDAR data: A review. IEEE Sens. J. 2020, 21, 1152–1171. [Google Scholar] [CrossRef]
Liu, F.; Chen, D.; Zhou, J.; Xu, F. A review of driver fatigue detection and its advances on the use of RGB-D camera and deep learning. Eng. Appl. Artif. Intell. 2022, 116, 105399. [Google Scholar] [CrossRef]
Jiao, Y.; Jie, Z.; Chen, S.; Chen, J.; Ma, L.; Jiang, Y.G. MSMD Fusion: Fusing LiDAR and camera at multiple scales with multi-depth seeds for 3D object detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: New York, NY, USA, 2023; pp. 21643–21652. [Google Scholar]
Poggi, M.; Tosi, F.; Batsos, K.; Mordohai, P.; Mattoccia, S. On the synergies between machine learning and binocular stereo for depth estimation from images: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5314–5334. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Yang, X.; Yu, Y.; Jia, W.; Chu, Z.; Guo, Y. Depth estimation by combining binocular stereo and monocular structured-light. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 1746–1755. [Google Scholar]
Li, Y.; Liu, X.; Dong, W.; Zhou, H.; Bao, H.; Zhang, G.; Zhang, Y.; Cui, Z. DeltaR: Depth estimation from a lightweight ToF sensor and RGB image. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 619–636. [Google Scholar]
Barreto, J.; Roquette, J.; Sturm, P.; Fonseca, F. Automatic camera calibration applied to medical endoscopy. In Proceedings of the BMVC 2009—20th British Machine Vision Conference, London, UK, 7–10 September 2009; HAL: Milwaukee, WI, USA, 2009; pp. 1–10. [Google Scholar]
Yadav, N.K.; Saraswat, M. A novel fuzzy clustering based method for image segmentation in RGB-D images. Eng. Appl. Artif. Intell. 2022, 111, 104709. [Google Scholar] [CrossRef]
Lin, J.; Gu, Y.; Du, G.; Qu, G.; Chen, X.; Zhang, Y.; Gao, S.; Liu, Z.; Gunasekaran, N. 2D/3D image morphing technology from traditional to modern: A survey. Inf. Fusion 2024, 117, 102913. [Google Scholar] [CrossRef]
Peng, Y.; Yang, M.; Zhao, G.; Cao, G. Binocular-vision-based structure from motion for 3-D reconstruction of plants. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8019505. [Google Scholar] [CrossRef]
Massaro, G. Assessing the 3D resolution of refocused correlation plenoptic images using a general-purpose image quality estimator. Eur. Phys. J. Plus 2024, 139, 727. [Google Scholar] [CrossRef]
Castaneda, V.; Mateus, D.; Navab, N. SLAM combining ToF and high-resolution cameras. In Proceedings of the 2011 IEEE Workshop on Applications of Computer Vision (WACV), Kona, HI, USA, 5–7 January 2011; IEEE: New York, NY, USA, 2011; pp. 672–678. [Google Scholar]
Endres, F.; Hess, J.; Sturm, J.; Cremers, D.; Burgard, W. 3-D mapping with an RGB-D camera. IEEE Trans. Robot. 2013, 30, 177–187. [Google Scholar] [CrossRef]
Song, L.; Wu, W.; Guo, J.; Li, X. Survey on camera calibration technique. In Proceedings of the 2013 5th International Conference on Intelligent Human-Machine Systems and Cybernetics, Hangzhou, China, 26–27 August 2013; IEEE: New York, NY, USA, 2013; Volume 2, pp. 389–392. [Google Scholar]
Chen, D.; Huang, T.; Song, Z.; Deng, S.; Jia, T. AGG-Net: Attention-guided gated-convolutional network for depth image completion. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; IEEE: New York, NY, USA, 2023; pp. 8853–8862. [Google Scholar]
Li, B.; Xu, Z.; Gao, F.; Cao, Y.; Dong, Q. 3D reconstruction of high reflective welding surface based on binocular structured light stereo vision. Machines 2022, 10, 159. [Google Scholar] [CrossRef]
Wu, L.; Zhu, B. Binocular stereovision camera calibration. In Proceedings of the 2015 IEEE International Conference on Mechatronics and Automation (ICMA), Beijing, China, 2–5 August 2015; IEEE: New York, NY, USA, 2015; pp. 2638–2642. [Google Scholar]
Wang, T.L.; Ao, L.; Zheng, J.; Sun, Z.B. Reconstructing Depth Images for Time-of-Flight Cameras Based on Second-Order Correlation Functions. Photonics 2023, 10, 1223. [Google Scholar] [CrossRef]
Qiao, X.; Poggi, M.; Deng, P.; Wei, H.; Ge, C.; Mattoccia, S. RGB-guided ToF imaging system: A survey of deep learning-based methods. Int. J. Comput. Vis. 2024, 132, 4954–4991. [Google Scholar] [CrossRef]
Cai, Z.; Han, J.; Liu, L.; Shao, L. RGB-D datasets using Microsoft Kinect or similar sensors: A survey. Multimed. Tools Appl. 2017, 76, 4313–4355. [Google Scholar] [CrossRef]
Liu, Q.; Sun, X.; Peng, Y. A Distortion Image Correction Method for Wide-Angle Cameras Based on Track Visual Detection. Photonics 2025, 12, 767. [Google Scholar] [CrossRef]
Yin, Y.; Zhu, H.; Yang, P.; Yang, Z.; Liu, K.; Fu, H. High-precision and rapid binocular camera calibration method using a single image per camera. Opt. Express 2022, 30, 18781–18799. [Google Scholar] [CrossRef]
Grossberg, M.D.; Nayar, S.K. A general imaging model and a method for finding its parameters. In Proceedings of the Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; IEEE: New York, NY, USA, 2001; Volume 2, pp. 108–115. [Google Scholar]
Sturm, P.; Ramalingam, S. A generic concept for camera calibration. In Proceedings of the 8th European Conference on Computer Vision (ECCV ’04), Prague, Czech Republic, 11–14 May 2004; HAL: Milwaukee, WI, USA, 2004; pp. 1–13. [Google Scholar]
Yu, J.; McMillan, L. General linear cameras. In Proceedings of the Computer Vision—ECCV 2004, Prague, Czech Republic, 11–14 May 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 14–27. [Google Scholar]
Strobl, K.H.; Hirzinger, G. More accurate pinhole camera calibration with imperfect planar target. In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; IEEE: New York, NY, USA, 2011; pp. 1068–1075. [Google Scholar]
Bielecki, J.; Wojcik-Gargula, A.; Wiacek, U.; Scholz, M.; Igielski, A.; Drozdowicz, K.; Woznicka, U. A neutron pinhole camera for PF-24 source: Conceptual design and optimization. Eur. Phys. J. Plus 2015, 130, 145. [Google Scholar] [CrossRef][Green Version]
Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 22, 1330–1334. [Google Scholar] [CrossRef]
Urban, S.; Leitloff, J.; Hinz, S. Improved wide-angle, fisheye and omnidirectional camera calibration. ISPRS J. Photogramm. Remote Sens. 2015, 108, 72–79. [Google Scholar] [CrossRef]
Claus, D.; Fitzgibbon, A.W. A rational function lens distortion model for general cameras. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: New York, NY, USA, 2005; Volume 1, pp. 213–219. [Google Scholar]
Genovese, K. Single-image camera calibration with model-free distortion correction. Opt. Lasers Eng. 2024, 181, 108348. [Google Scholar] [CrossRef]
Zhu, H.; Li, Y.; Liu, X.; Yin, X.; Shao, Y.; Qian, Y.; Tan, J. Camera calibration from very few images based on soft constraint optimization. J. Frankl. Inst. 2020, 357, 2561–2584. [Google Scholar] [CrossRef]
Bergamasco, F.; Cosmo, L.; Gasparetto, A.; Albarelli, A.; Torsello, A. Parameter-free lens distortion calibration of central cameras. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 3847–3855. [Google Scholar]
Lopez, M.; Mari, R.; Gargallo, P.; Kuang, Y.; Gonzalez-Jimenez, J.; Haro, G. Deep single image camera calibration with radial distortion. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 11817–11825. [Google Scholar]
Lin, J.; Zhao, M.; Yin, G.; Zhou, H.; Hudoyberdi, T.; Jiang, B. A method for depth camera calibration based on motion capture system. In Proceedings of the 2023 IEEE International Conference on the Cognitive Computing and Complex Data (ICCD), Huaian, China, 21–22 October 2023; IEEE: New York, NY, USA, 2023; pp. 242–246. [Google Scholar]
Ramalingam, S.; Sturm, P. A unifying model for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1309–1319. [Google Scholar] [CrossRef] [PubMed]
Rong, J.; Huang, S.; Shang, Z.; Ying, X. Radial lens distortion correction using convolutional neural networks trained with synthesized images. In Proceedings of the Computer Vision—ACCV 2016, Taipei, Taiwan, 20–24 November 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 35–49. [Google Scholar]
Zhao, K.; Liao, K.; Lin, C.; Liu, M.; Zhao, Y. Joint distortion rectification and super-resolution for self-driving scene perception. Neurocomputing 2021, 435, 176–185. [Google Scholar] [CrossRef]
Liao, K.; Lin, C.; Zhao, Y. A deep ordinal distortion estimation approach for distortion rectification. IEEE Trans. Image Process. 2021, 30, 3362–3375. [Google Scholar] [CrossRef]
Ray, L.S.S.; Zhou, B.; Krupp, L.; Suh, S.; Lukowicz, P. SynthCal: A synthetic benchmarking pipeline to compare camera calibration algorithms. In Proceedings of the Pattern Recognition: 27th International Conference, ICPR 2024, Kolkata, India, 1–5 December 2024; ACM: New York, NY, USA, 2023. [Google Scholar]
Yang, L.; Wang, B.; Zhang, R.; Zhou, H.; Wang, R. Analysis on location accuracy for the binocular stereo vision system. IEEE Photonics J. 2017, 10, 7800316. [Google Scholar] [CrossRef]
Zhou, Y.; Chen, D.; Wu, J.; Huang, M.; Weng, Y. Calibration of RGB-D camera using depth correction model. J. Phys. Conf. Ser. 2022, 2203, 012032. [Google Scholar] [CrossRef]
Basso, F.; Menegatti, E.; Pretto, A. Robust intrinsic and extrinsic calibration of RGB-D cameras. IEEE Trans. Robot. 2018, 34, 1315–1332. [Google Scholar] [CrossRef]
Ramírez-Hernández, L.R.; Rodríguez-Quinoñez, J.C.; Castro-Toscano, M.J.; Hernández-Balbuena, D.; Flores-Fuentes, W.; Rascón-Carmona, R.; Lindner, L.; Sergiyenko, O. Improve three-dimensional point localization accuracy in stereo vision systems using a novel camera calibration method. Int. J. Adv. Robot. Syst. 2020, 17, 1729881419896717. [Google Scholar] [CrossRef]
Liu, S.; Zhang, X.; Xu, L.; Ding, F. Expectation–maximization algorithm for bilinear systems using the Rauch–Tung–Striebel smoother. Automatica 2022, 142, 110365. [Google Scholar] [CrossRef]
Abdel-Aziz, Y.I.; Karara, H.M. Direct linear transformation from comparator coordinates into object space coordinates in close-range photogrammetry. Photogramm. Eng. Remote Sens. 2015, 81, 103–107. [Google Scholar] [CrossRef]
Zheng, Y.; Peng, S. A practical roadside camera calibration method based on least squares optimization. IEEE Trans. Intell. Transp. Syst. 2013, 15, 831–843. [Google Scholar] [CrossRef]
Tian, S.X.; Lu, S.; Liu, Z.M. Levenberg–Marquardt algorithm based nonlinear optimization of camera calibration for relative measurement. In Proceedings of the 2015 34th Chinese Control Conference (CCC), Hangzhou, China, 28–30 July 2015; IEEE: New York, NY, USA, 2015; pp. 4868–4872. [Google Scholar]
Le, T.M.; Fatahi, B.; Khabbaz, H.; Sun, W. Numerical optimization applying trust-region reflective least squares algorithm with constraints to optimize the non-linear creep parameters of soft soil. Appl. Math. Model. 2017, 41, 236–256. [Google Scholar] [CrossRef]
Triggs, B.; McLauchlan, P.F.; Hartley, R.I.; Fitzgibbon, A.W. Bundle Adjustment—A Modern Synthesis. In Proceedings of the Vision Algorithms: Theory and Practice, International Workshop on Vision Algorithms, Corfu, Greece, 21–22 September 1999; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 1999; Volume 1883, pp. 298–372. [Google Scholar]
Yuhai, O.; Cho, Y.; Choi, A.; Mun, J.H. Enhanced Three-Axis Frame and Wand-Based Multi-Camera Calibration Method Using Adaptive Iteratively Reweighted Least Squares and Comprehensive Error Integration. Photonics 2024, 11, 867. [Google Scholar] [CrossRef]
Cooper, M.A.; Raquet, J.F.; Patton, R. Range information characterization of the hokuyo ust-20lx lidar sensor. Photonics 2018, 5, 12. [Google Scholar] [CrossRef]

Figure 1. Intel RealSense D455.

Figure 2. Orbbec Femto Bolt.

Figure 3. Structure sketch map of motion capture system.

Figure 4. Schematic diagram of position points in the dataset for Cam 1: (a) training set; (b) testing set.

Figure 5. Instance images from the dataset. The color coding represents the relative depth, with warmer colors (red) indicating smaller depth values and cooler colors (blue) indicating larger depth values.

Figure 6. The overall workflow of our method. The EM algorithm calibrates the RGB lens using the ground-truth coordinates provided by motion-capture markers and recovers the true depth information. Subsequently, the camera depth measurements are calibrated via a least-squares approach.

Figure 7. Calibration results of different methods on the training and test set.

Figure 8. Relative pixel points by different methods in the training set.

Figure 9. Comparison of iterative processes between random initialization (blue) and initialization based on prior information (red).

Figure 10. The ground truth (green), original depth (red), and calibrated depth (blue) of the position points in the training set.

Figure 11. Comparison of the errors of calibrated and original values in the test set.

Figure 12. Comparison of depth maps for Cam 1 and Cam 2: (a) RGB image, (b) original depth map, and (c) calibrated depth map.

Figure 13. Comparison of local details in depth maps before and after calibration in Cam 1.

Table 1. Common intrinsic parameters of the two cameras.

Parameter	Intel RealSense D455 (Cam 1)	Orbbec Femto Bolt (Cam 2)
$f_{x} / f_{y}$	387.3/388.7	1121.29/1120.45
$c_{x} / c_{y}$	321.2/243.7	950.894/547.579
Width/Height	640/480	1920/1080
$k_{1} / k_{2} / k_{3}$	0.0063/−0.0040/0.0028	0.0790/−0.0111/0.0480
Depth range	0.6–6 m	0.25–5.46 m

Table 2. Algorithm for the expectation step in intrinsic parameter estimation.

Algorithm: Expectation Step for Intrinsic Parameter Estimation
Input: 3D points $P_{c i} = (x_{c i}, y_{c i}, z_{c i})$ , 2D pixel points $P_{p i} = (x_{p i}, y_{p i})$ , distortion parameters $k_{1}, k_{2}, k_{3}, k_{4}$ , image width w and height h.
Output: Intrinsic parameters $θ = (θ_{x}, θ_{y})$ .
Step 1: Initialization. The intrinsic parameters are initialized as $(\begin{matrix} θ_{x} \\ θ_{y} \end{matrix}) = (\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} (x_{p i} - \frac{w}{2}) \frac{z_{c i}}{x_{c i}} \\ \frac{1}{n} \sum_{i = 1}^{n} (y_{p i} - \frac{h}{2}) \frac{z_{c i}}{y_{c i}} \end{matrix}) .$ Step 2: Radial distance computation. For each correspondence $i = 1, \dots, n$ , the radial distance is computed as $r_{i} = \sqrt{{(x_{p i} - \frac{w}{2})}^{2} + {(y_{p i} - \frac{h}{2})}^{2}} .$ Step 3: Parameter update. The intrinsic parameters are updated according to $(\begin{matrix} θ_{x} \\ θ_{y} \end{matrix}) = (\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} (\frac{x_{p i}}{k_{1} + k_{3} r_{i}^{2}} - \frac{w}{2}) \frac{z_{c i}}{x_{c i}} \\ \frac{1}{n} \sum_{i = 1}^{n} (\frac{y_{p i}}{k_{2} + k_{4} r_{i}^{2}} - \frac{h}{2}) \frac{z_{c i}}{y_{c i}} \end{matrix}) .$ Return: $θ_{x}, θ_{y}$ .

Table 3. Algorithm for the maximization step in distortion parameter estimation.

Algorithm: Maximization Step for Distortion Parameter Estimation
Input: 3D points $P_{c i} = (x_{c i}, y_{c i}, z_{c i})$ , 2D pixel points $P_{p i} = (x_{p i}, y_{p i})$ , intrinsic parameters $θ = (θ_{x}, θ_{y})$ , image width w and height h.
Output: Distortion parameters $k_{x}$ and $k_{y}$ .
Step 1: Radial distance computation. For each correspondence $i = 1, \dots, n$ , the radial distance is computed as $r_{i} = \sqrt{{(x_{p i} - \frac{w}{2})}^{2} + {(y_{p i} - \frac{h}{2})}^{2}} .$ Step 2: Matrix construction. The matrices $A_{x}$ and $A_{y}$ are constructed as $A_{x} = [\begin{matrix} x_{c 1} & r_{1} \\ x_{c 2} & r_{2} \\ ⋮ & ⋮ \\ x_{c n} & r_{n} \end{matrix}], A_{y} = [\begin{matrix} y_{c 1} & r_{1} \\ y_{c 2} & r_{2} \\ ⋮ & ⋮ \\ y_{c n} & r_{n} \end{matrix}] .$ Step 3: Distortion parameter estimation. The distortion parameters are estimated as $(\begin{matrix} k_{x} \\ k_{y} \end{matrix}) = (\begin{matrix} \frac{A_{x}^{- 1} (p_{x} - \frac{w}{2})}{θ_{x} w} \\ \frac{A_{y}^{- 1} (p_{y} - \frac{h}{2})}{θ_{y} h} \end{matrix}) .$ Return: $k_{x}, k_{y}$ .

Table 4. Pixel error statistics for Cam 1 and Cam 2 under different methods.

Method	Cam 1 (Pixel)				Cam 2 (Pixel)
Method	<2	<4	<6	>6	<2	<4	<6	>6
DLT	12	26	27	0	3	15	22	5
LS	8	23	26	1	1	13	20	7
LM	4	21	24	3	0	12	16	11
TRF	9	22	26	1	1	13	17	10
EM	24	27	27	0	6	22	26	1

Table 5. Comparison of RMSE and speed among various methods on data from two cameras.

Method	RMSE (Pixel)		Speed (s)
Method	Cam 1	Cam 2	Cam 1	Cam 2
DLT	1.26	4.02	0.001	0.001
LS	1.47	4.73	1.494	0.083
LM	1.70	5.02	1.802	1.871
TRF	1.50	4.72	1.011	0.370
EM (Ours)	0.89	3.22	0.007	0.007

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, J.; Du, G.; Zhang, Y.; Zhao, Y.; Xie, Q.; Yao, J.; Khadka, A. Expectation–Maximization Method for RGB-D Camera Calibration with Motion Capture System. Photonics 2026, 13, 183. https://doi.org/10.3390/photonics13020183

AMA Style

Lin J, Du G, Zhang Y, Zhao Y, Xie Q, Yao J, Khadka A. Expectation–Maximization Method for RGB-D Camera Calibration with Motion Capture System. Photonics. 2026; 13(2):183. https://doi.org/10.3390/photonics13020183

Chicago/Turabian Style

Lin, Jianchu, Guangxiao Du, Yugui Zhang, Yiyan Zhao, Qian Xie, Jian Yao, and Ashim Khadka. 2026. "Expectation–Maximization Method for RGB-D Camera Calibration with Motion Capture System" Photonics 13, no. 2: 183. https://doi.org/10.3390/photonics13020183

APA Style

Lin, J., Du, G., Zhang, Y., Zhao, Y., Xie, Q., Yao, J., & Khadka, A. (2026). Expectation–Maximization Method for RGB-D Camera Calibration with Motion Capture System. Photonics, 13(2), 183. https://doi.org/10.3390/photonics13020183

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Expectation–Maximization Method for RGB-D Camera Calibration with Motion Capture System

Abstract

1. Introduction

2. Related Work

2.1. Traditional Universal Camera Calibration Methods

2.2. Learning-Based Camera Calibration Methods

2.3. RGB-D Camera Calibration

3. Method

3.1. Equipment and Data

3.2. EM Algorithm for RGB Camera Calibration

3.2.1. Expectation Step (E-Step)

3.2.2. Maximization Step (M-Step)

3.3. Depth Correction

4. Results and Discussion

4.1. Evaluation Criteria

4.2. Pixel Point Calibration for RGB Image

4.3. Initialization of Latent Variable Parameters

4.4. Depth Calibration

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI