1. Introduction
Wireless capsule endoscopy (WCE) was pioneered by Given Imaging in the year 2000 [
1]. It offers numerous advantages over traditional endoscopic procedures. It is less invasive, requires no sedation, and offers a painless and comfortable experience for patients. It is used to visually inspect the entire gastrointestinal (GI) tract, from the esophagus to the large intestine, using a small swallowable capsule equipped with a miniature camera. It is used to diagnose inflammatory bowel disease, GI bleeding, and polyps [
2]. Despite its many advantages, WCE images also entail several challenges. These include issues related to uneven and low illumination, low resolution, and noise [
3]. Moreover, the lack of control over the capsule’s movement within the GI tract restricts the thorough examination of areas of particular interest.
Three-dimensionally (3D)-reconstructed models of WCE images can be effective for conducting a comprehensive analysis of specific areas of interest. By employing 3D reconstruction algorithms, it becomes feasible to transform the 2D images captured by the capsule camera into a 3D representation of the GI tract. Three-dimensional models along with their images can allow gastroenterologists to visualize internal organs from different angles and perspectives, aiding in the identification of abnormalities and facilitating more precise planning for interventions and surgeries. Results in [
4] have shown that gastroenterologists find 3D models useful to an extent that they sometimes prefer them over original images.
Within the realm of computer vision, 3D reconstruction poses an intriguing challenge, which requires the utilization of different techniques to image data [
5]. Vision-based depth estimation techniques can be classified into different categories. A range of techniques for monocular image-based depth estimation have been developed, including texture gradient analysis [
6], image focus analysis [
7], and photometric methods [
8]. Other approaches leverage multiple images, relying on camera motion or variations in relative camera positions [
9]. The integration of 3D reconstruction techniques finds extensive applications across diverse fields, spanning cultural heritage, robotics, medical diagnostics, video surveillance, and more [
10,
11].
In numerous real-world applications, capturing multiple images of a scene or object from various angles can be challenging. Consequently, single-image-based methods prove effective and suitable in such situations. This is particularly evident in the case of WCE, where the capsule relies on the natural peristaltic contractions to traverse through the human GI tract. Given its low frame rate, it happens that the scene within the GI tract is captured only once. In such circumstances, single-image-based 3D reconstruction techniques are the only viable option.
Shape from shading (SFS) is a method that requires only one image for 3D reconstruction, and therefore, it is a potential candidate for WCE application. Horn and Brooks [
12] were among the first to recover the 3D shape of the surface using the SFS method. They obtained surface gradients through an iterative approach relying on a nonlinear first-order partial differential equation (PDE), establishing a relationship between 3D shape and intensity variations within an image. By applying integrability constraints, Frankot and Chellappa [
13] demonstrated superior accuracy and efficiency in estimating the depth variable compared with the approach by Horn and Brooks. Kimmel and Sathian [
14] employed the numerical scheme based on the fast marching method to recover depth, yielding a numerically consistent, computationally optimal, and practically fast algorithm for the classical SFS problem. Tankus et al. [
15] remodeled the SFS method under the framework of perspective projection, expanding its range of potential applications. Similarly, Wu et al. [
16] also solved the SFS problem under perspective projection without assuming the light source at the camera center, with a specific focus on medical endoscopy.
The method proposed by Wu et al. [
16] closely aligns with the WCE setting, featuring a near-light model with multiple light sources positioned around the camera center. Consequently, we selected their method as a starting point for further experimentation. The methodology follows a two-step process for shape reconstruction. Initially, it involves deriving a reflectance function by considering the relative positions of the light sources, camera, and surface reflectance properties. Following this, the error between the reflectance function and image irradiance is minimized by formulating an image irradiance equation (IIE). While a typical solution to IIE involves an L2 regularizer as a smoothness constraint, we opted for anisotropic diffusion (AD) due to its superior accuracy compared with the L2 regularizer [
17].
WCE presents considerable challenges in the domain of 3D reconstruction due to its inherent limitations. The device lacks controllability over its light settings, requiring expensive equipment for operation, which is often unavailable. These practical constraints pose significant challenges when attempting to conduct extensive experiments regarding the assessment of 3D reconstruction quality for WCE images. To address these challenges, we successfully conducted a comprehensive investigation on the 3D reconstruction of synthetic colon images captured with a camera in a virtual environment [
4]. In the following experiments, we initially employ images of an artificial colon captured under a controlled environment using an industrial endoscope for the purpose of 3D reconstruction, before transitioning to the analysis of images obtained from WCE. The imaging system of the endoscope behaves like that of WCE, though it introduces significantly less lens distortion. Moreover, it offers higher resolution than a typical WCE image, and the light strength can be manually controlled. The endoscope has six rectangular-shaped light-emitting diodes (LEDs) surrounding the camera behind a protective glass covering. The known dimensions of the artificial colon provide a reference for assessing the correctness of the reconstructed 3D colon model.
This article utilizes a single-image-based method to reconstruct the 3D shape of the artificial colon. The camera is corrected for lens distortion, and the light source intensity of the endoscope has also been measured. The camera response function (CRF) is estimated to convert the device’s output grayscale image to image irradiance. The method proposed by Andersen et al. [
18] is employed, which uses a single image of a ColorChecker to compute the CRF of a camera with an unknown imaging pipeline. Wu et al. [
16] assume an ideal multiple-point light model in their PSFS approach. Given that the endoscope is equipped with six light sources, it should closely align with the characteristics of the ideal six-point light model. However, the endoscope light sources produce a different pattern due to their rectangular shape and the presence of a glass covering, which can lead to scattering and interference effects. Therefore, corrections are applied to the captured image to account for this deviation. Thereafter, the near-light perspective SFS (PSFS) algorithm that integrates AD as a smoothness constraint is applied to reconstruct the 3D shapes of the endoscopic images. The PSFS algorithm utilizes grayscale images. Therefore, the albedo is simply a reflection factor between 0 and 1. Initially, well-defined primitive objects are tested to assess the method’s robustness and accuracy. Afterward, the same methodology is applied to recover the geometry of the colon. The known dimensions of the artificial colon also provide a reference for assessing the correctness of the reconstructed 3D colon model. In the end, we present preliminary results of 3D reconstruction using PillCam images, illustrating the potential applicability of our method across various endoscopic devices. The core contributions of the paper are as follows:
We present a comprehensive pipeline for step-by-step 3D reconstruction using an AD-based PSFS algorithm, as demonstrated in
Figure 1. This pipeline is generic and applicable to any endoscopic device, provided that we have access to the required image data for 3D reconstruction, as well as data for geometric and radiometric calibration.
We utilized JPG images and opted for an endoscope where access to RAW image data was unavailable, reflecting real-world scenarios where RAW data may not be accessible. This choice underscores the practical applicability of our approach, as in many real-world applications, access to RAW data is limited.
We validated the AD-based PSFS method in real-world scenarios by conducting 3D reconstruction on simple primitives and comparing the results with ground truth—a practice seldom addressed in the literature. This rigorous validation process enhances the credibility and reliability of our approach.
We present simple methods for estimating the spatial irradiance and light source intensity of the endoscope, designed for scenarios where relying on multiple images for radiometric calibration is not feasible. Further details on these methods are provided in
Section 2.4 of the article.
The rest of the article is organized as follows:
Section 2 provides an overview of various methodologies for 3D reconstruction, encompassing the PSFS model with anisotropic diffusion, geometric and radiometric calibration of the endoscope, albedo measurement, image rescaling, and denoising.
Section 3 details the entire experimental setup, beginning with the creation of ground truth models, followed by image capture, and concluding with the reconstruction of 3D surfaces for primitives and an artificial colon. Additionally, preliminary results for WCE images are presented. Lastly,
Section 4 concludes the article.
2. Methods Overview
This section covers various methods involved in 3D reconstruction using the PSFS method with an output image from an endoscope. We begin by introducing the PSFS method with AD (
Section 2.1). Following that, we discuss the different calibration and preprocessing steps necessary before inputting the image into the PSFS algorithm. Initially, geometric calibration of the endoscope is conducted by capturing images of a checkerboard to correct distortion and determine camera intrinsic parameters, such as focal length (
Section 2.2). Subsequently, the captured endoscopic image intended for 3D reconstruction undergoes radiometric calibration, involving the computation of CRF and spatial irradiance (
Section 2.4). The radiometrically corrected image is then rescaled (
Section 2.5) and denoised (
Section 2.6). The comprehensive pipeline of the 3D reconstruction algorithm using the PSFS method is illustrated in
Figure 1.
2.1. PSFS Model
In this section, we cover the PSFS method, where six-point light sources are placed around a camera and the camera is directed towards the negative z-axis, as shown in
Figure 2. Under perspective projection, the relationship between image coordinates
and the camera coordinates
is given as follows:
where
f denotes the camera’s focal length. Assuming a diffuse surface, the reflected light from the point
can be determined using Lambert’s cosine law and inverse square fall-off law from multiple light sources as follows [
16]:
where
represents the intensity of the light source(s),
denotes the albedo of the surface, and
p and
q are the surface gradient components along the
x and
y directions, respectively. Furthermore,
accounts for the inverse square fall-off distance from each point light source,
is a unit vector aligned along the
light ray, and
refers to the surface unit normal, which is computed as follows [
12]:
Given the distance from the camera center to a light source, we can explicitly write the light source vector from the point
as follows:
where
is the distance from the camera center to a light source,
for
. The unit vector
can be expressed as
.
According to Horn and Brooks [
12], IIE can be written as follows:
where
is the image irradiance. Equation (
5) is solved to determine the optimal depth value
z by minimizing the difference between
and
. The optimization equation is established for
z, while the values of
p and
q are updated through the gradients of the modified
z [
17]. The error
is minimized as follows:
where
and
represent irradiance error and smoothness constraint, respectively.
is a weighting factor and controls the scaling between
and
.
can be computed over the image domain
as follows:
is typically a L2 regularizer. However, we have employed AD as a smoothness constraint because it not only enhances the accuracy of the depth map by suppressing noise but also demonstrates effectiveness in preserving structural details of the reconstructed scene, outperforming the L2 regularizer [
17,
19].
AD is introduced as a smoothness constraint by first calculating a
structure tensor (
) based on the gradient of the depth
z [
20].
is given as [
20] as follows:
Subsequently, we compute the corresponding eigenvalues
and eigenvectors
following a similar approach to [
21]. Utilizing
and
, the diffusion tensor
is then derived as follows:
In terms of
, Lagrangian density
can be written as follows [
22]:
Equations (
7) and (
10) are combined in Equation (
6) and can be formulated as follows:
The solution to Equation (
11) is given by Euler–Lagrange PDE:
which we numerically solve by gradient descent:
Similar to [
17],
is utilized to derive the structure tensor. Through this single-step computation of the structure tensor, the process becomes efficient, making the computation task simpler and more linear.
2.2. Geometric Calibration
Geometric calibration is needed to estimate the camera’s intrinsic parameters as well as its lens distortion. It has been observed that the endoscope exhibits minimal lens distortion towards its periphery. However, the necessity arises to rectify this distortion for the sake of precise depth estimation, as the SFS algorithm assumes a pinhole model.
For geometric calibration, we employed a standard checkerboard measuring
cm, with each individual square on the board measuring 4 mm. The images are taken at a 10 cm distance from the tip of the camera at different angles. The MATLAB camera calibration toolbox is used for the geometric calibration of the endoscope [
23]. The intrinsic parameters are computed using Heikkila’s method [
24] with two extra distortion coefficients corresponding to tangential distortion.
The MATLAB camera calibration toolbox basically requires between 10 and 20 images of the checkerboard from different viewing angles. A total of 15 images of the checkerboard are used in our case. An image of the checkerboard is shown in
Figure 3a. The camera model is set to standard, and radial distortion is set to 2 coefficients as it is observed that the endoscope camera has little distortions towards the periphery.
Figure 3b shows a sample image of the colon corrected for lens distortion.
It is important to mention here that the procedure is repeated three times with three different sets of checkerboard images to confirm the consistency in the results. The estimated focal length is around mm for all three sets, and there is no skew observed.
2.3. Albedo Measurement
Albedo is the fraction of incident light that a surface reflects. It has a value between 0 and 1, where 0 corresponds to all the incident light being absorbed by the surface and 1 corresponds to a body that reflects all incident light. The primitives have diffused white surfaces. Therefore, the albedo is assumed to be for all the primitive objects.
The artificial colon consists of a soft rubber material with a nearly uniform pinkish color. Therefore, it is necessary to measure the albedo of the surface. The albedo of the colon is measured by taking the image of the colon and a diffuse spectralon tile placed side by side. Both the spectralon and the colon are kept at an equal distance from the camera, and an image is taken outside so that both surfaces have a uniform distribution of light, as shown in
Figure 4a. The albedo of the surface is measured by taking the ratio between the colon and the spectralon pixel value at any given location. The estimated albedo value of the artificial colon is
.
2.4. Radiometric Calibration
Radiometric calibration has been performed to measure the light intensity, CRF, and spatial distribution of the light intensity on the image. The PSFS algorithm assumes a pinhole model with ideal multiple-point light sources. Therefore, it is crucial to convert from a grayscale image to image irradiance via CRF and correct for the anisotropy of the light source [
16], as discussed in
Section 2.4.2.
Section 2.4.2 and
Section 2.4.3 provide detailed discussions on the CRF estimation and anisotropy correction, respectively. Measuring light source intensity is also important, as it is a crucial parameter for computing the reflection function given in Equation (
2).
2.4.1. Light Source Intensity Measurement
The light intensity of the endoscope is measured by using a CS2000 spectroradiometer [
25]. An integrating sphere (IS) must be used to measure intensity because of the nonisotropic behavior of the light source. The IS is a hollow spherical cavity with its interior coated with diffused white reflective material. The aim of the integrating sphere is to provide a stable and uniform illumination condition. An endoscope is placed inside the IS, and radiance power
P is measured over the visible spectrum. After measuring the solid angle
of the endoscope light,
is calculated as follows:
. The nonuniformity of the light source, the uniformity of the endoscope light inside the IS, and spectra of the light are shown in
Figure 4b–d, respectively.
2.4.2. Camera Response Function
CRF is essential to convert the device’s output grayscale image to image irradiance [
16]:
where
is the image irradiance,
is the grayscale image, and
is the CRF.
incorporates the deviation from the ideal point-light source assumed by PSFS.
The endoscope used in this work has an unknown imaging processing chain, and there are no means of controlling the exposure time. This decision was intentional, reflecting the common limitation among WCE devices available in the market, which generally do not offer any control over the exposure time. By selecting an endoscope that mimics the behavior of typical WCE devices, our approach demonstrates applicability to a broader range of endoscopic devices.
Through experimental observations with the endoscope, it has been observed that the camera performs automatic exposure adjustments. During the image capture process of the SG ColorChecker [
26], we have further noted the camera’s automatic color adjustment and white balancing mechanisms in operation. It is worth noting that this is similar to the functionality of a standard WCE. These complicating factors have compelled us to abstain from methods that utilize multiple images for the estimation of the CRF.
The method by Andersen et al. [
18] is applied to measure the CRF. The method requires only a single image of a ColorChecker to estimate volumetric, spatial, and per-channel nonlinearities. These nonlinearities involve compensating for both physical scene and camera properties through a series of successive signal transformations, bridging the gap between the estimated linear and recorded responses. The estimation process relies on a novel principle of additivity, computed using the spectral reflectances of the colored patches on the ColorChecker. The SG ColorChecker [
26] is used to estimate the CRF. An image of the ColorChecker from endoscope and camera response curves is shown in
Figure 5.
2.4.3. Spatial Irradiance
The reflection model mentioned in the PSFS method is based on six-point light sources and demands an ideal six-point light distribution in the image to correctly determine the 3D geometry. The endoscope light deviates from an ideal six-point light distribution model due to the rectangular shape of the light sources and the scattering and interference effect caused by the glass on top of the endoscope lens. An inclination in the light sources has been detected, and also, due to the presence of six noncentral light sources, we observe a deviation where the maximum intensity does not align precisely with the image center. Therefore, it is important to quantify these additional effects and compensate for them so that the resulting reflection model satisfies the conditions of six-point light sources. According to [
16],
where the second term on the right side in Equation (
15) represents the light distribution from the six-point light sources.
An image of a white diffuse paper, considered as
in our context, was captured and is displayed in
Figure 6a. It is noticeable from the image that the endoscopic lighting deviates from the ideal six-point light configuration, exhibiting an oval pattern with an offset from the image center. The ideal six-point light distribution model is constructed by physically measuring the distance from the diffused paper to the tip of the endoscope, as shown in
Figure 6b. Finally,
is recovered using Equation (
15) and then compensated for in the image.
is shown in
Figure 6c.
2.5. Unit Conversion
The parameters computed thus far are in physical units, leading to the estimation of
R in physical coordinates. To establish a consistency between
and
R, as outlined in Equation (
13),
is transformed from pixel units to physical units. This conversion is achieved as follows:
where
denotes the physical value of the image irradiance.
is the angle between the surface normal and the light ray at the point on the surface where illumination is maximized.
r is the distance from the light source to the point on the surface where illumination is maximized. In the case of the primitives, the points are measured, whereas, in the case of the colon, the estimation of the parameters
r and
relies on factors such as the field of view (FOV) of the camera, the total length of the colon, and the position of the camera within the colon.
2.6. Image Denoising
In endoscope images, significant noise is observed, mainly due to JPEG compression artifacts. These artifacts include blocky patterns and color distortions. A noisy image when fed into the SFS algorithms can destabilize the differential equations due to inaccuracies and ambiguities in shading information, which can lead to inaccuracies in the estimation of surface normals and object shape.
In order to reduce the noise, the method by Xu et al. is utilized [
27]. The method essentially separates the visual information related to the surface texture of an object from its underlying structural components within an image. We employ this method to remove noise from the image while retaining its structural details. The method is based on the relative total variation scheme, which captures the essential difference between texture and structure by utilizing their different properties. Later, they employed an optimization method that leverages novel variation measures, including inherent variation and relative total variation, to identify significant structures while disregarding the underlying texture patterns.
2.7. Assessment Criteria
The reconstructed 3D shapes of the different primitives are compared with ground truth models by measuring relative root-mean-square error () and relative max-depth error (). These metrics are chosen to quantify depth errors with respect to a reference depth, making the results easily interpretable. quantifies the overall geometric deformation present in the reconstructed 3D model, while highlights the maximum deviation observed between the 3D-reconstructed model and the ground truth.
allows for the evaluation of geometric distortion in the 3D-reconstructed model. A perfect 3D reconstruction is indicated by an error value of 0, whereas a highly distorted 3D reconstruction corresponds to a value of 1.
is computed as follows:
where
D,
,
, and
n represent ground truth depth, maximum ground truth depth point, depth of the recovered 3D shape, and total number of depth points considered for error estimation, respectively.
indicates the relative maximum deviation between the estimated depth values produced by a 3D reconstruction algorithm and the ground truth depth values. A low
suggests that the majority of depth estimates are close to their ground truth counterparts, indicating high accuracy in the 3D reconstruction. Conversely, a high
implies significant discrepancies between the estimated and ground truth depth values, indicating poorer accuracy in some places in the reconstruction.
is computed as follows: