A Hybrid Deep Learning and Optical Flow Framework for Monocular Capsule Endoscopy Localization

Yakar, İrem; Kuçak, Ramazan Alper; Bilgi, Serdar; Ferhanoglu, Onur; Akinci, Tahir Cetin

doi:10.3390/electronics14183722

Open AccessArticle

A Hybrid Deep Learning and Optical Flow Framework for Monocular Capsule Endoscopy Localization

by

İrem Yakar

^1,*

,

Ramazan Alper Kuçak

¹,

Serdar Bilgi

¹

,

Onur Ferhanoglu

² and

Tahir Cetin Akinci

^3,*

¹

Department of Geomatics Engineering, Istanbul Technical University, 34469 Istanbul, Türkiye

²

Department of Electronics and Communication Engineering, Istanbul Technical University, 34469 Istanbul, Türkiye

³

Bourns College of Engineering, Center for Environmental Research and Technology (CE-CERT), University of California, Riverside, CA 92507, USA

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(18), 3722; https://doi.org/10.3390/electronics14183722

Submission received: 25 July 2025 / Revised: 12 September 2025 / Accepted: 17 September 2025 / Published: 19 September 2025

Download

Browse Figures

Versions Notes

Abstract

Pose estimation and localization within the gastrointestinal tract, particularly the small bowel, are crucial for invasive medical procedures. However, the task is challenging due to the complex anatomy, homogeneous textures, and limited distinguishable features. This study proposes a hybrid deep learning (DL) method combining Convolutional Neural Network (CNN)-based pose estimation and optical flow to address these challenges in a simulated small bowel environment. Initial pose estimation was used to assess the performance of simultaneous localization and mapping (SLAM) in such complex settings, using a custom endoscope prototype with a laser, micromotor, and miniaturized camera. The results showed limited feature detection and unreliable matches due to repetitive textures. To improve this issue, a hybrid CNN-based approach enhanced with Farneback optical flow was applied. Using consecutive images, three models were compared: Hybrid ResNet-50 with Farneback optical flow, ResNet-50, and NASNetLarge pretrained on ImageNet. The analysis showed that the hybrid model outperformed both ResNet-50 (0.39 cm) and NASNetLarge (1.46 cm), achieving the lowest RMSE of 0.03 cm, with feature-based SLAM failing to provide reliable results. The hybrid model also gained a competitive inference speed of 241.84 ms per frame, outperforming ResNet-50 (316.57 ms) and NASNetLarge (529.66 ms). To assess the impact of the optical flow choice, Lucas–Kanade was also implemented within the same framework and compared with the Farneback-based results. These results demonstrate that combining optical flow with ResNet-50 enhances pose estimation accuracy and stability, especially in textureless environments where traditional methods struggle. The proposed method offers a robust, real-time alternative to SLAM, with potential applications in clinical capsule endoscopy. The results are positioned as a proof-of-concept that highlights the feasibility and clinical potential of the proposed framework. Future work will extend the framework to real patient data and optimize for real-time hardware.

Keywords:

localization; capsule endoscopy; SLAM; deep learning; optical flow

1. Introduction

Endoscopy is a medical procedure that enables doctors to evaluate and visualize the interiors of the body cavities in the gastrointestinal (GI) tract. New technologies like wireless capsule endoscopy (WCE) have recently surfaced, which cause the patient very little discomfort [1]. However, existing capsule endoscopes do not allow for control over movement or orientation; instead, they move passively, depending only on intestinal peristalsis [2,3]. Accurate and reliable determination of the location of capsule endoscopy (CE) in the gastrointestinal system both increases the accuracy of diagnosis and directly affects the effectiveness of the treatment process. Localization also contributes to the monitoring of the capsule’s progression, transit time analysis, and evaluation of intestinal motility disorders. In this context, studies on CE localization play a critical role in modern medicine in terms of both diagnosis and treatment.

Currently, the position of the capsule is determined by different wearable sensors and radio frequency (RF) triangulation in the 3D abdomen. In addition, various alternative methods, which are also sensor-based, have been proposed [4,5]. However, these methods can only determine the location of CE within the abdominal cavity in 2D or 3D and cannot provide information about its location within the GI tract. On the other hand, visual-based approaches are in the early stages for intraluminal localization and have shown promising results [6,7]. Recent advances in artificial intelligence (AI) increasingly contribute to CE analysis and prove its high accuracy and reliability in medical image analysis [8]. While traditional geometry-based methods require manually designed features to provide matching between images, such as Visual SLAM (V-SLAM), DL-based localization, mapping, and reconstruction methods can perform end-to-end feature learning over large datasets [3,9]. In this context, techniques such as DL-based SLAM [10,11], Structure from Motion (SfM) [12,13], and Neural Field Rendering [14,15] offer significant potential in endoscopic applications in body cavities [3].

In this study, a hybrid DL and optical flow framework is proposed for six-degree-of-freedom (6-DoF) pose estimation and localization task for CE applications in an environment with homogeneous texture, repetitive visual patterns, and limited distinctive features, such as the small intestine. First, pose estimation was performed using Oriented FAST and Rotated BRIEF (ORB) and Scale-Invariant Feature Transform (SIFT)-based feature extraction and classical V-SLAM algorithms; however, it was observed that these methods did not perform sufficiently in this complex environment, and the feature matches were insufficient and unstable. Therefore, in order to overcome the cases where V-SLAM failed, a CNN-based hybrid approach enriched with optical flow (Farneback) was developed and evaluated in a test environment. Within the scope of the study, it was demonstrated that the proposed hybrid model showed superior performance with lower RMSE and faster inference time compared to traditional DL architectures such as ResNet-50 and NASNetLarge. In this context, the developed framework offers a promising alternative for clinical CE applications where real-time and accurate localization is critical.

This study consists of six sections. In the introduction, the importance of CE localization and the inadequacies of the existing methods are explained. In the related studies, SLAM and AI-based approaches are examined, and gaps in the literature are highlighted. In the methodology, classical V-SLAM, Farneback optical flow, and DL architectures (ResNet-50, NASNetLarge) are introduced, and the proposed hybrid method is detailed. In the implementation section, the developed techniques are applied in a test environment, and the results are compared with other methods. In the results and discussion, the models are compared in terms of RMSE, training time, and inference time, and the superior performance of the hybrid model is demonstrated. Finally, in the conclusion, the general contribution of the study is summarized, and suggestions for future developments are included.

2. Related Studies

The related study can be summarized in two categories in terms of CE localization: classical V-SLAM methods and AI applications. Although there are many literature reviews addressing the revolutionary role of AI in modern endoscopy practices, most of these studies focus only on image recognition and segmentation techniques, such as computer-aided diagnosis (CADx) and lesion detection. DL-based localization, mapping, and 3D reconstruction are rarely addressed in [3]. A performance evaluation of SLAM algorithms commonly used in endoscopy is presented in [16], but the discussion is limited to geometric-based SLAM methods only. Similarly, [17] also addresses the issue of endoscopic 3D reconstruction only within the framework of traditional methods such as shape from shading (SfS) and SfM.

On the other hand, [10] proposes a CNN-based system that performs 6-DoF pose estimation for CE using only monocular camera images. Tested on a realistic human stomach model, the system achieved high accuracy with 7.1% position and 3.4% orientation error. It achieved robust results even in endoscopic environments with low resolution, distorted, and repetitive textures. Different feature descriptors, such as SIFT, SURF, Local Intensity Order Pattern (LIOP), Maximally Stable External Regions (MSER), and DL-based PWC-Net, were compared, and the best results in visual matching were obtained with PWC-Net in [18]. It has been stated that AI-based approaches offer significant advantages in real-time motion tracking, and CNN-based feature descriptors give successful results even in poorly textured, inhomogeneous illumination environments in [19]. In addition, the triangle matching algorithm proposed by [20] performed the soft tissue surface tracking task with a depth matching network trained on speckle matching data. It showed promise in terms of precise navigation for endoscopic robots. The real-time and reliable 3D pose estimation problem required for the control of active CE robots is addressed, and a transformer-based architecture for position and depth estimation is proposed in [21]. The model aims to cope with these complex structures by using the self-attention mechanism against challenges such as the deformable structure of the gastrointestinal system and specular reflections occurring on non-textured surfaces. This approach, developed by taking advantage of the success of the transformer architecture in visual tasks, contributes to the development of fully autonomous CE systems for advanced diagnostic and therapeutic procedures by providing a more precise determination of the capsule position.

3. Methodology

3.1. V-SLAM

The main aim in V-SLAM systems is to accurately determine the location of the robot, which is one of the most critical stages of the entire process [22]. This process includes a number of important steps required for the robot to detect its location successfully. A particularly prominent step is feature tracking, which involves tasks such as feature extraction, matching, re-localization, and pose estimation [23]. These steps help align and define frames to ensure the creation of the first key frame of the input data [24]. The key frame consists of image frames that contain the observed feature points and the camera poses together. This structure plays an important role in tracking and positioning processes and contributes to the elimination of drift errors in the camera poses on the robot [25,26]. This key framework is then transferred to the next stage to create the preliminary map [27,28,29].

3.2. Farneback Optical Flow

In this study, the Farneback dense optical flow algorithm (OpenCV implementation) was used for inter-frame motion estimation. This algorithm calculates dense displacement fields by modeling pixel neighborhoods with a polynomial expansion and using a pyramidal approximation. The parameters were not specifically modified, and the default OpenCV values, which are widely used in the literature, were used. This achieved reliable performance in low-texture environments. A block diagram summarizing the basic processing steps of the algorithm is presented in Figure 1.

The Farneback method treats an image as a two-dimensional signal. In this approach, the value of each pixel in the image is considered as a function of a two-dimensional variable. A local coordinate system is defined in which each pixel is considered as the center, and this function is approximated by a quadratic polynomial in Equation (1), [30]:

f (x) = x^{T} A x + b^{T} x + c

(1)

A is a symmetric matrix of dimensions 2 × 2, b is a vector of dimensions 2 × 1, and c is a constant coefficient. These coefficients are estimated using the least squares method, taking into account the neighboring pixels around each pixel.

The new function obtained as Equation (2), after the pixel is displaced by d is

f_{2} (x) = {(x - d)}^{T} A_{1} (x - d) + b_{1}^{T} (x - d) + C_{1}

(2)

Farneback uses the following relationships to find the motion vector d:

A_{2} = A_{1}

(3)

b_{2} = b_{1} - 2 A_{1} d

(4)

C_{2} = d^{T} A_{1} d - b_{1}^{T} d + C_{1}

(5)

Based on Equations (3)–(5), the displacement vector is obtained as Equation (6), [30]:

d = - \frac{(b_{2} - b_{1})}{2}

(6)

A₁ is the symmetric estimation of the local second-order coefficients; (b₂ − b₁) represents the difference in the linear terms obtained from two consecutive frames.

3.3. Lucas–Kanade Optical Flow

Optical flow, first introduced by Gibson in 1950, is a vector field that expresses the instantaneous motion of an object in a scene on the image plane. The basic idea is to use pixel changes and inter-frame correlation between consecutive frames to find correspondences between the last frame and the current frame, thereby extracting motion information. Optical flow can be generated by target motion, camera motion, or a combination of both; computational approaches generally fall into three categories: region/feature-based matching, frequency-domain, and gradient-based methods. In principle, a velocity vector is assigned to each pixel, resulting in a motion vector field; points in the image correspond to 3D object points via projection. While the vectors are continuous in the scene when there is no moving object, the vectors diverge when there is relative motion between the target and the background. The target’s position is calculated from this difference. Once the position is found, the target is tracked iteratively by observing the vertex/feature relationships in consecutive frames [31].

3.4. DL Architectures

3.4.1. ResNet-50 Model

ResNet is one of the most widely used CNN architectures. CNN architectures generally consist of convolution layers, pooling layers, and fully connected layers. In addition, operations such as zero padding, Rectified Linear Unit (ReLU) activation function, and batch normalization are integrated into these structures in order to increase the accuracy rate. The most important advantage of ResNet is that it can solve the vanishing gradient problem. This success is achieved thanks to the residual block structure that adds a “shortcut” to each layer. These blocks allow gradients to reach the first layers of the network more effectively. In this way, performance degradation is prevented as the depth increases [32].

There are different versions of ResNet: ResNet-50, ResNet-101, and ResNet-152. The basic structure in these architectures starts with the processing of the input image with a convolutional layer. The convolution operation is the systematic application of a filter matrix (kernel) to the image and is defined by Equation (7), [32]:

G [m, n] = (f * h) [m, n] = \sum_{j}^{x} x \sum_{k}^{x} h [j, k] \cdot f [m - j, n - k]

(7)

Following deep learning practice, we implement ‘convolution’ as cross-correlation without kernel flipping; for mathematical completeness, the classical convolution is also defined above.

In Equation (8), f is the input image, h is the filter, and m and n are the row and column indices.

f (x) = m a x (0, x)

(8)

After each convolution layer, the ReLU activation function is usually applied. This function sets negative values equal to zero and preserves positives, allowing the model to learn faster and more effectively. The subsequent pooling layer, especially max pooling, reduces the spatial dimension and lightens the computational burden by taking the maximum value at each filter position, thus obtaining a more concise representation of the image features. Finally, the average pooling layer reduces the image by taking the average value in each region. The fully connected layer that follows makes the resulting feature map ready for classification [32].

3.4.2. Neural Architecture Search Network (NASNet) Model

Google Brain developed the NASNet model in 2018 [33]. The NASNetLarge model, which is a pretrained model from the NASNet architecture family, eliminates the need to design a model according to the dataset manually, thanks to its ability to automatically generate the network structure. This architecture offers the advantage of reducing the number of parameters without compromising accuracy. The basic search method used in the pretrained model of NASNet is the NAS Framework [34], and this structure consists of CNN [35] and Control Recurrent Neural Network (CRNN) components. CRNN evaluates the performance of CNN-based subnetworks (Child Network) and optimizes the structure of the model through reinforcement learning (RL) [36,37].

3.5. Hybrid CNN–Optical Flow Approach

This section presents the proposed hybrid framework for monocular 3D pose estimation for CE. The approach combines the strengths of a CNN for pose regression and a geometric optical flow algorithm for short-term motion refinement.

3.5.1. CNN-Based Absolute Pose Regression

The core of our method is a DL model based on the ResNet-50 architecture pretrained on the ImageNet dataset. The final classification layer of ResNet-50 is replaced with a custom regression head comprising fully connected layers, as in Equation (9), that outputs a 6-DoF pose vector:

p = [x, y, z, φ, θ, ψ]

(9)

(x, y, z) represents the 3D translation of the camera, and (φ, θ,

ψ

) is its rotation in roll, pitch, and yaw, respectively.

The following phases make up the suggested pipeline:

The input images are converted from BGR to RGB.

The images are downsized to 224 by 224 pixels.

The normalization procedure makes use of the ImageNet mean and standard deviation values.

Pose regression predicts an absolute 6-DoF pose.

Images were converted from BGR to RGB to ensure compatibility with the deep learning framework (PyTorch/TensorFlow), which requires RGB input.

Each frame is processed independently, which may lead to temporal drift and inconsistency in sequential predictions, particularly in visually repetitive or homogeneous environments like the small bowel.

3.5.2. Optical-Flow-Based Relative Motion Estimation

This study uses the Farneback dense optical flow algorithm to estimate pixel-wise motion between consecutive grayscale image frames. Farneback proposed a classical method for solving the dense optical flow problem, which allows the motion vector and contour information of the object to be obtained. Farneback optical flow is a gradient-based algorithm that uses a polynomial expansion model to estimate the motion between two frames, enabling dense optical flow computation [38]. Applications of the Farneback method have also yielded practical results; for example, [39] successfully applied this algorithm in laser speckle imaging for autonomous vessel detection [30]. In addition to our empirical observations, the literature also supports the choice of the Farneback algorithm for motion estimation tasks. For example, [40] conducted a comparative analysis of different optical flow algorithms for anomaly detection. Their results demonstrated that the Farneback algorithm outperformed Horn–Schunck and Lucas–Kanade methods in terms of execution time, while still providing reliable motion estimation. These findings further confirm the suitability of the Farneback method in applications requiring both robustness and computational efficiency.

The Farneback method computes a dense motion field by approximating neighborhoods of the image with quadratic polynomials, allowing the estimation of displacement vectors for every pixel. The steps include the following:

Grayscale conversion

Dense flow computation

Let Δx and Δy denote the mean flow in the horizontal and vertical directions, respectively. These are used to derive a corrective translation vector Δtxy, which refines the CNN-predicted translation in Equation (10):

t^{'} = t_{C N N} + λ Δ t_{x y}

(10)

where λ ∈ [0, 1] is a fusion weight.

Equation (10) represents the general formulation of the corrective translation vector, while Equations (15) and (16) provide its component-wise weighted implementation.

3.5.3. Pose Fusion and Final Output

For each frame, the final estimated pose is computed by combining the CNN-predicted absolute pose with the optical-flow-derived relative motion:

CNN provides structure of the trajectory, including depth (z) and orientation.

Optical flow improves short-term positional accuracy, especially in lateral movement (x, y).

The final output is a smoothed pose sequence that integrates the strengths of data-driven and geometry-based estimation.

Let each input image be denoted as It at time. The goal is to estimate the 6-DoF pose Tt ∈ R⁶ in Equation (11), defined as:

T_{t} = [x_{t}, y_{t}, z_{t}, φ_{t}, θ_{t}, ψ_{t}]

(11)

where (x_t, y_t, z_t) are the translation components, and (φ_t, θ_t,

ψ_{t}

) represent roll, pitch, and yaw angles, respectively.

In CNN-based pose estimation, given a preprocessed image It ∈ R224 × 224 × 3, the CNN regressor f_CNN predicts the 6-DoF pose in Equations (12) and (13):

{\hat{T}}_{t}^{C N N} = f_{C N N} (I_{t})

(12)

{\hat{T}}_{t}^{C N N} = [{\hat{x}}_{t}^{C N N}, {\hat{y}}_{t}^{C N N}, {\hat{z}}_{t}^{C N N}, {\hat{φ}}_{t}^{C N N}, {\hat{θ}}_{t}^{C N N}, {\hat{ψ}}_{t}^{C N N}]

(13)

T denotes the general 6-DoF pose and

{\hat{T}}_{t}^{C N N}

is the estimated value predicted by CNN.

In optical-flow-based displacement, dense motion fields between consecutive grayscale images I_t−1 and I_t, were computed using Farneback optical flow. The mean displacement in x and y is calculated as in Equation (14):

Δ_{x t}^{0 F} = \frac{1}{N} \sum_{i = 1}^{N} u i, Δ_{y t}^{0 F} = \frac{1}{N} \sum_{i = 1}^{N} v i

(14)

where ui and vi are the horizontal and vertical flow vectors at pixel i, and N is the total number of pixels.

A weighted average is applied for the x and y translation components, while the z and rotation components are taken from the CNN to combine the CNN and optical flow results. (Equations (15)–(18)):

{\hat{x}}_{t} = a {\hat{x}}_{t}^{C N N} + (1 - a) Δ x_{t}^{0 F}

(15)

{\hat{y}}_{t} = a {\hat{y}}_{t}^{C N N} + (1 - a) Δ y_{t}^{0 F}

(16)

{\hat{z}}_{t} = {\hat{z}}_{t}^{C N N}

(17)

{\hat{φ}}_{t} = {\hat{φ}}_{t}^{C N N}, {\hat{θ}}_{t} = {\hat{θ}}_{t}^{C N N}, {\hat{ψ}}_{t} = {\hat{ψ}}_{t}^{C N N}

(18)

In our hybrid framework, the weighted fusion is applied only to the lateral translation components (x and y). The depth (z) and all rotational components are directly taken from the CNN predictions, since our dataset labels were defined solely along the longitudinal axis with 0.5 cm increments. The fused lateral components are subsequently mapped to metric units using the scale factor defined in Equations (22) and (23). This design ensures unit consistency while reflecting the constrained geometry of the phantom setup. The full estimated pose vector is then as in Equation (19):

{\hat{T}}_{t} = [{\hat{x}}_{t}, {\hat{y}}_{t}, {\hat{z}}_{t}, {\hat{φ}}_{t}, {\hat{θ}}_{t}, {\hat{ψ}}_{t}]

(19)

The estimated translations are cumulatively added together to obtain the camera trajectory over time as in Equation (20):

P_{t} = P_{t - 1} + [{\hat{x}}_{t}, {\hat{y}}_{t}, {\hat{z}}_{t}]

(20)

with P₀ = 0 as the initial position.

Equation (20) accumulates unit-consistent translations: it is metric from the CNN supervision, while the fused lateral displacements (x,y) are scaled into centimeters according to Equations (22) and (23).

Then, the estimated translations from each frame are cumulatively added to obtain the overall trajectory of the camera using Equation (21).

P_{t} = {λ P}_{t - 1} + (1 - λ) P_{t}

(21)

The workflow of the proposed hybrid ResNet-50–optical flow framework is shown in Figure 2.

4. Implementation

4.1. Circumferential-Scanning Endoscope

We developed a probe comprising a laser diode source, a prism located on a micromotor, and an endoscopic camera. The probe housing was established through wrapping all parts within a cylindrical and transparent protective sheath. The probe was tested on a cylindrical phantom that had arbitrarily distributed bumps at a height of 2–3 mm. The probe captured a video during a circumferential scanning operation as it was pulled through the phantom using a translational stage. For this purpose, a test area was created in the optoelectronics laboratory, in order to obtain controlled distances between image-taking points. The test area dimensions are 9 cm in height and 5 cm in width (Figure 3). The phantom was designed to mimic the morphology and texture of the jejunum segment of the small intestine, which is characterized by relatively homogeneous surface patterns.

In the test environment, the capsule is placed inside a test tube and is traversed along the longitudinal axis using a linear stage. This setup mimics the geometry of the jejunum segment of the small intestine. Figure 4 depicts the proposed capsule scheme.

4.2. SLAM Application

To benchmark our study, we evaluated the SLAM performance with monocular images obtained in our test area. In this context, monocular SLAM using ORB and SIFT features and descriptors was used separately for the localization process, utilizing a sequence of images to estimate the 3D trajectory of the camera over time, to observe the performance of SLAM in challenging environments like the small bowel. Therefore, a total of 19 images were captured during the process with 0.5 cm intervals. The photos used in the localization process are shown in Figure 5.

The first method uses ORB, a fast and efficient feature extractor and matcher, which is suitable for real-time or resource-limited applications. For each consecutive image pair, ORB features and descriptors are extracted, and these features are matched using a brute-force matcher with Hamming distance. Therefore, the relative camera motion (rotation R and translation t) using the essential matrix is obtained, and these poses are accumulated to track the camera’s estimated position in 3D space. Then, the resulting (x, y, z) positions to visualize the trajectory of the camera were stored and plotted.

The other approach implements a basic monocular visual odometry pipeline using SIFT features and essential matrix-based pose estimation to track the 3D trajectory of a camera from sequential images. It uses SIFT to extract keypoints and descriptors from consecutive image frames. Features are matched between image pairs using a brute-force matcher, and relative camera motion (rotation and translation) is estimated via the essential matrix. The detected keypoints and matched features between consecutive images using SIFT and ORB are shown in Figure 6 and Figure 7, respectively.

Therefore, the camera’s pose is updated incrementally over time, and the resulting 3D positions are plotted to visualize the camera path. The SLAM Localization with SIFT and ORB, respectively, is shown in Figure 8.

In the region marked with a rectangle in Figure 8, it is observed that the SLAM algorithm cannot produce the estimated camera poses. Since this region has low contrast and limited distinguishing features in the visual sense, it became difficult for the algorithm to make the necessary matches, and tracking failed. In particular, the failure to detect a sufficient number of reliable visual features between consecutive frames prevented the correct modeling of the motion and caused the localization system to fail in this area. As a result, the camera position could not be determined in this region, and a gap occurred in the estimated trajectory. This experiment confirmed that SLAM algorithms may perform poorly in such environments where visual diversity is low, repetitive patterns or homogeneous textures are dense.

4.3. Hybrid CNN and Optical Flow Application

In this application, a hybrid pose estimation framework that integrates a fine-tuned ResNet-50 CNN with classical optical flow techniques was proposed to improve localization accuracy from monocular endoscopic images and to overcome feature-based SLAM failures. Training of all CNN models was performed in the following hardware and software environment:

CPU: Intel Core i7-8750H

GPU: NVIDIA GeForce GTX 1060

RAM: 16 GB DDR4

Storage: 1 TB SSD

Python Version: 3.12.1

CUDA Version: 11.3

ResNet-50 was chosen due to its proven generalization capabilities in the medical imaging field and its robustness against over-learning. This is especially important in areas where labeled data is limited and visual similarity between classes is high, such as gastrointestinal tract images. The model architecture is based on the pretrained ResNet-50 backbone on ImageNet, where the last 30 layers are not frozen for fine-tuning. A regression head consisting of global average pooling and two dense layers predicts 6-DoF camera poses (x, y, z, roll, pitch, yaw) from preprocessed 224 × 224 RGB images.

A partial fine-tuning strategy was adopted in network training. In this strategy, the last 30 layers of the network, the upper-level residual blocks, and the regression head were left trainable, while the first layers (early-stage convolution layers) were frozen. In this way, the low-level features learned with the ImageNet dataset (edges, corners, textures) were preserved, while the higher-level layers were adapted to learn patterns specific to endoscopic images. The following precautions were taken to train the model stably and prevent over-learning:

Early stopping according to the validation dataset;

Data augmentation techniques, including random brightness variation and rotation, were applied to simulate variability.

Data augmentation on the training data is used to increase the overall accuracy of the model. Various manipulations are made on the images, including rotation, zoom, shift, brightness adjustment, and horizontal flip. These operations allow the model to be trained on a wider range of data and increase its generalization ability. The dataset is labeled as synthetic, assuming continuous forward motion of the camera along the z-axis (0.5 cm per frame) and all other pose components are initialized to zero. Therefore, the model is trained using the Adam optimizer with early stopping based on validation loss. During inference, the poses estimated by the CNN are combined with displacements derived from the Farneback dense optical flow that were computed between consecutive image pairs. To improve short-term motion estimation and reduce frame-to-frame drift, dense optical flow based on the Farneback method was calculated between consecutive grayscale image frames. Unlike sparse feature tracking, this approach estimates pixel-wise displacement across the entire image. The average values of horizontal and vertical flow vectors (Δx, Δy) were calculated to estimate planar motion. The weighted fusion method combines the predicted positions by balancing the contributions of CNN and optical flow with the parameter α. α value was determined as 0.9. In other words, 90% weight was given to CNN predictions, while 10% weight was given to optical flow. The α (alpha) value was used to optimize the accuracy of pose estimates obtained by combining the outputs of the hybrid model with CNN and optical flow data. α serves as a parameter that balances the weights of CNN and optical flow contributions. This parameter aims to improve the overall accuracy of the model by determining the weight given to the estimates of each component (CNN and optical flow).

During the study, the effect of the α value was investigated with different experiments. Initially, an equally weighted value of α = 0.5 was preferred, but as a result of the experiments, it was observed that the best performance was provided by the value of α = 0.9. Comparative experiments with different α values (0.1, 0.3, 0.5, 0.7, and 0.9) were conducted. The results revealed that α = 0.9 consistently achieved the lowest RMSE (0.03 cm), while smaller α values increased error due to reduced reliance on CNN predictions. Thus, α = 0.9 was selected as the optimal balance between CNN-based long-term stability and optical-flow-based short-term refinement. The main reason for choosing this value is that the optical-flow-based approach provides more accurate estimates in short-term movements, while the CNN model produces more reliable results in long-term estimates. The value of α = 0.9 reflects the sensitivity of optical flow to short-term movements, while balancing the stronger long-term prediction power of CNN, minimizing the total error rate.

The accumulated 3D trajectory is visualized and printed to evaluate the effectiveness of the hybrid strategy in capturing the spatial progression of the camera and improves monocular pose recovery, demonstrating the benefit of combining DL with classical motion estimation.

This hybrid framework combines the structural consistency of CNN-based regression with the short-term accuracy of dense geometric motion estimation. It is particularly suitable for environments where tissue is limited and motion is complex, e.g., minimally invasive diagnostics, robotic navigation, or endoscopy. The results show that combining ResNet-50-based deep regression with dense optical flow provides improved robustness and accuracy in monocular visual localization tasks. Figure 9 demonstrates the visualization of the estimated trajectory using the CNN and optical flow hybrid pipeline.

Figure 10 demonstrates the train and validation loss and Mean Absolute Error (MAE). This graph visualizes the performance of the model by comparing the MSE Loss (Mean Square Error) and MAE values on the training and validation data during the training process. In the first epochs, the training loss (Train loss) and validation loss (Val loss) are high, while both losses decrease rapidly as the training progresses.

This shows that the model starts to learn on the training data and starts to make correct predictions on the validation data. Similarly, a significant decrease is observed in the Train MAE and Val MAE values, which indicates that the error in the model’s predictions decreases, i.e., the accuracy rate increases. In the later periods of the training process, the validation loss and validation MAE start to stabilize, which shows that the model does not overfit and its generalization ability is successful. As a result, the model reaches the best validation performance thanks to early stopping, while minimizing the risk of overfitting during the training process. The displacement over frames with optical flow is shown in Figure 11.

To demonstrate the performance of Farneback optical flow used in the proposed method, the same strategy is performed with Lucas–Kanade optical flow. The localization result of Resnet50 and Lucas–Kanade optical flow can be seen in Figure 12.

Figure 13 demonstrates the train and validation loss and Mean Absolute Error (MAE).

4.4. ResNet-50 and NASNetLarge Application

To evaluate the effectiveness of our proposed ResNet-50–optical-flow-based hybrid localization framework in the small bowel, the results were compared with two different baseline DL architectures; ResNet-50 and NASNetLarge with end-to-end training.

Initially, 6-DoF pose estimation was performed using a model built on a ResNet-50 architecture previously trained on ImageNet and fine-tuned with the last 30 layers of the regression head. Endoscopic images were resized, normalized, and sequentially converted to the training dataset to fit the model. Training is performed on the last 30 layers of the model and the regression head (output layer). The size of the images is set to 224 × 224 and processed in RGB format. Transfer learning is applied using weights trained on ImageNet data during the training process. The weights of the model are updated using the Adam optimizer (learning rate 1 × 10⁻⁴). MSE and MAE are determined as the loss function and the performance metric. During training, transformations (rotation, zoom, displacement, brightness change, and horizontal flip) are added to each image in the dataset by applying data augmentation. This process allows the model to learn on more data and reduces the risk of overfitting. The images are divided into 80% training and 20% validation with train–validation split. The batch size value used for training is 8, meaning that learning is performed on eight images at a time. In addition, early stopping is used to prevent the model from over-learning; here, if the validation loss does not improve for five epochs, the training is stopped. A total of 20 epochs are used during training, but early stopping can shorten the training time. The best model weights obtained in training are saved as the weights of the model with the lowest validation loss. Pseudo-pose data corresponding to each image was created with 0.5 cm increments only in the Z axis. As a result, the estimated positions were visualized with a three-dimensional graph, and the movement path followed by the model over time was drawn. The localization result of fine-tuned Resnet-50 is shown in Figure 14.

Figure 15 demonstrates the train and validation loss and MAE values of ResNet-50 implementation.

Finally, the NASNetLarge model pretrained on ImageNet was used for 6-DoF pose estimation from monocular endoscopic images. The implementation also includes data augmentation techniques such as image rotation, shift, and translation, thus increasing the training dataset from 19 original test images to 190 augmented images. The training process is performed by freezing the initial layers of the NASNetLarge model (to preserve the previously learned features) and training only the last 30 layers with the newly added regression head. The model is compiled with Adam optimizer and MSE loss. Consecutive endoscopic images used as input were resized and normalized to fit the model, and pseudo-pose labels created with only 0.5 cm increments along the Z axis were used for each image. The model was trained with a batch size of 8 for 20 epochs, and the total training time was measured and reported. After training, pose estimation was made for each image; the first position was accepted as the starting position, and the subsequent estimations were cumulatively collected to create a 3D tracking route. The obtained results were visualized in a 3D environment, and the motion estimation performance of the model was analyzed. The localization result of NASNetLarge is shown in Figure 16.

Figure 17 demonstrates the training and validation loss and MAE values of the NASNetLarge implementation.

The basic structures and training strategies of four different DL-based models used for 6-DoF pose estimation using monocular endoscopic images are shown in Table 1. The first two models are based on only CNN-based regression approach, using ResNet-50 and NASNetLarge backbones. The third and fourth model offers a hybrid pose estimation approach by considering the motion information between consecutive frames with the optical flow in addition to the classical CNN approach.

5. Results and Discussion

Each model was evaluated according to four different metrics: localization accuracy (RMSE), rotational accuracy, training time, and inference time per image, in the study.

A scale factor was assigned in order to obtain the real-world distances using Equations (22) and (23):

S c a l e F a c t o r = \frac{R e a l D i s t a n c e}{M o d e l D i s t a n c e}

(22)

P (s c a l e d) = P (o r i g i n a l) * S c a l e F a c t o r

(23)

RMSE calculation was carried out based on the real-world measurements between the image acquisition points, which is 0.5 cm between each point. Residuals (V) were obtained by calculating the differences between the values predicted by the model and the ground truth values. RMSE was calculated using these residuals.

The fine-tuned ResNet-50 model showed a great performance in terms of RMSE, but the inference time was long. The NASNetLarge model achieved lower accuracy; the training and inference time were also very long. The proposed model in the study, ResNet-50–optical flow hybrid framework, outperformed its alternatives in accuracy and speed: it provided low RMSE and was reasonably fast. These results show that combining the motion information between consecutive frames with optical flow and the high-level visual features obtained with CNNs significantly improves pose estimation accuracy, especially in monocular camera localization tasks. RMSE comparison of pose estimation models is given in Figure 18.

The effect of the CNN weight (α) was also calculated in terms of RMSE (Figure 19).

As α increases (CNN weight increases), the RMSE decreases monotonously from 1.02 cm to 0.03 cm, indicating that the dominance of the CNN contribution in the hybrid significantly increases the accuracy, and the best result is obtained at α = 0.9.

Table 2 presents the comparative performances of different CNN-based models on the monocular pose estimation task.

The rotational accuracy was also tested. Images were taken on a flat surface on a fixed platform, and rotations were assumed as (0, 0, 0) for all test data. Rotation errors estimated by the ResNet-50 model remained at very low levels with roll (0.01 radian), pitch (0.02 radian), and yaw (0.05 radian) angles, indicating that the model provides accurate results in pose estimation. In the ResNet-50 (Fine-Tuned) + Farneback optical flow (Hybrid) model, more stable estimations were made, and rotation errors were further reduced thanks to the integration of optical flow. Roll (0.04 radian), pitch (0.013 radian), and yaw (0.015 radian) errors of this model reveal that optical flow increases accuracy with visual motion information. On the other hand, rotation errors are higher in NASNetLarge model; roll (0.14 radian), pitch (0.13 radian), and yaw (0.08 radian) errors show that the larger structure of the model causes a loss of accuracy when generalizing. Rotation errors estimated by the ResNet-50 (Fine-Tuned) + Lucas–Kanade optical flow was also noted as roll (0.09 radian), pitch (0.04 radian), and yaw (0.011 radian). In general, the combination of ResNet-50 and ResNet-50 (Fine-Tuned) + Farneback optical flow provided more accurate and reliable predictions in the tests performed on flat ground, showing that optical flow integration is beneficial. The rotational errors of each model are shown in Table 3.

Classical feature-based methods, ORB and SIFT, were also evaluated for comparison. However, both methods suffered from significant tracking losses in the phantom environment. Especially in low-texture regions, the algorithms frequently failed because a sufficient number of reliable matches was not found, thus preventing a continuous trajectory. Therefore, RMSE calculation was not possible. This significant failure of ORB and SIFT demonstrates the limited applicability of classical methods to homogeneous tissues like intestines. In contrast, the proposed hybrid CNN + optical flow approach achieved stable localization throughout the entire sequence without any tracking loss in any frame and demonstrated superior performance with an error value of 0.03 cm. This clear contrast strongly demonstrates that the proposed method is much more reliable and robust than classical methods.

In the study, a new hybrid framework is proposed for 6-DoF pose estimation performed with monocular images in difficult and narrow areas such as CE. This framework evaluates both visual and temporal information simultaneously by combining a DL-based CNN model (fine-tuned ResNet-50) and the Farneback optical flow algorithm. The developed system aims to make the pose estimation task more accurate and stable by taking advantage of the continuity of motion between frames. The proposed framework in the study was tested comparatively with two different DL architectures. In addition, V-SLAM methods widely used in the existing literature were also applied in the test environment, but it was observed that SLAM algorithms could not provide sufficient accuracy and stability in this special scenario. Especially in endoscopic images with no texture, repeated patterns, and low distinctiveness, SLAM-based solutions experienced serious performance loss. The proposed system aims to increase both accuracy and stability in pose estimation by combining the deep visual feature extraction power of a fine-tuned ResNet-50 model with the temporal motion sensitivity of the Farneback optical flow algorithm. In the comparative experiments, the RMSE value of 0.39 cm in the ResNet-50 model was reduced to 0.03 cm in the hybrid model. The NASNetLarge model achieved lower success with 1.46 cm RMSE and also fell short in terms of inference time (529.66 ms/frame). In contrast, the proposed hybrid model showed the best performance with 0.03 cm RMSE and 241.84 ms/frame inference time; it provided an improvement of approximately 92.31% in terms of accuracy compared to the only CNN model. The results show that the proposed CNN + optical flow hybrid approach provides a reliable alternative, especially in challenging environments where SLAM fails. Thanks to its real-time operability and low error rate, this system is a candidate to be used as an effective pose estimation tool in medical imaging systems and CE applications. In our comparison, the Lucas–Kanade method provided competitive accuracy within our hybrid framework (e.g., RMSE = 0.05 cm for α = 0.9). However, a higher deviation was observed compared to the Farnebäck method (Farnebäck: 0.03 cm). Based on these findings, Farnebäck was chosen as the default flow component in this study; however, Lucas–Kanade offers value as a viable alternative.

We acknowledge that dense optical flow may produce unreliable vectors in low-texture regions. In our design, optical flow contributes only to short-term lateral refinements (x, y), while the CNN provides stable depth (z) and orientation. Thus, the fusion mitigates the weaknesses of each component. Furthermore, although the CNN is treated as a regressor, it effectively learns gradient- and texture-based cues relevant to intestinal structures. Future work will integrate geometric constraints or attention mechanisms to improve interpretability. From a clinical perspective, the localization accuracy achieved of 0.03 cm is quite promising. Current localization methods used in capsule endoscopy (e.g., magnetic tracking or RF-based techniques) typically provide accuracies on the order of a few millimeters. In contrast, our hybrid CNN + optical flow approach demonstrated submillimeter accuracy in a phantom environment, exceeding the requirements of current clinical standards. Although obtained in a phantom environment, these results demonstrate a proof-of-concept with submillimeter accuracy, highlighting the potential for improving patient outcomes in capsule endoscopy. Such accuracy could enable more reliable lesion localization, facilitate evaluation of intestinal motility and transit times, and support targeted therapeutic interventions. While the findings were achieved under controlled conditions, they indicate the applicability of the proposed framework and its potential clinical value once validated on larger and more diverse datasets. Future studies will focus on verifying whether this performance can be maintained in real-world patient data and across different tissue conditions of the gastrointestinal tract.

Limitations and Future Work

This study has several limitations. First, the dataset is relatively small, consisting of 19 original images acquired in a phantom environment; these images were expanded to 190 samples using data augmentation methods. While data augmentation increases diversity, a dataset of this size does not fully guarantee robust generalization of the model. Therefore, future studies will focus on collecting datasets that include a wider range of conditions and include data from real patients. Second, the phantom used in this study was designed primarily to mimic the jejunum segment, which has relatively homogeneous tissues. Therefore, the results obtained may not be directly generalizable to intestinal segments such as the duodenum and ileum, which exhibit different textures and contrast characteristics. Furthermore, the fusion strategy was implemented using a fixed α value (0.9), and while this value provided the best performance in our setup, its stability across different anatomical conditions has not yet been verified. Frame-to-frame accumulation inevitably causes drift over long trajectories. In our phantom setup, which involves short and controlled sequences, drift remained bounded. Future studies will incorporate loop closure and depth cues to mitigate long-term divergence and will aim to develop adaptive or dynamic fusion strategies to improve generalizability. This phantom study does not include bowel turns calibration strategies for curved segments, and patient-specific anatomy will also be addressed in future work.

6. Conclusions

This study focuses on the problem of providing accurate localization in challenging environments such as the small bowel, where feature-based SLAM approaches fail. For this purpose, a test area that mimics the human small bowel was created in the Opto-Electronics Laboratory. Therefore, a hybrid framework that combines ResNet-50 and optical flow is proposed and compared with different DL-based models and SLAM results.

The experiments show that training the entire ResNet-50 architecture with fine-tuning gives moderate pose estimation accuracy. However, the most successful results are obtained with the proposed hybrid model; this model provided the lowest error rate with an RMSE value of 0.03 cm. This situation demonstrates the effectiveness of DL models based on visual features to gain motion awareness with optical flow.

The findings show that hybrid localization approaches provide more stable and high-accuracy results, especially in visually challenging conditions such as gastrointestinal images. In future work, the aim is to further increase the localization performance in complex and dynamic environments by enriching this framework with inertial measurement units (IMU) and evaluating transformer-based architectures. The original aspects and scientific contributions of the study are presented below:

A localization framework has been developed to address the failures of V-SLAM algorithms, tested in small-intestine-like conditions.

A hybrid method combining DL-based CNN architecture with optical flow information has been proposed.

This hybrid approach provides more stable and reliable pose estimation in visually homogeneous, low-contrast, and repetitive patterned environments.

The success of the hybrid model was evaluated comparatively with different DL architectures and classical SLAM methods, and it was revealed that the proposed approach was superior in terms of both accuracy and stability.

In areas where traditional methods failed, frame-based error accumulation was reduced by using optical-flow-based motion information, and pose estimation was made more consistent.

The study proposes a reliable and extensible solution for clinical applications such as CE.

Future studies aim to test the method with real patient data, add rotational motion components to the optical flow algorithm, and make the system real-time with hardware acceleration.

Author Contributions

Conceptualization, İ.Y., S.B. and O.F.; methodology İ.Y. and S.B.; software, İ.Y., R.A.K. and S.B.; validation, İ.Y.; writing—review and editing, S.B.; R.A.K., O.F. and T.C.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available on request.

Acknowledgments

This work was supported by the Istanbul Technical University Scientific Research Projects Fund [grant number MDK-2024-46459]. This publication is derived from the author’s PhD thesis.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ciuti, G.; Menciassi, A.; Dario, P. Capsule endoscopy: From current achievements to open challenges. IEEE Rev. Biomed. Eng. 2011, 4, 59–72. [Google Scholar] [CrossRef] [PubMed]
Ciuti, G.; Calio, R.; Camboni, D.; Neri, L.; Bianchi, F.; Arezzo, A.; Dario, P. Frontiers of robotic endoscopic capsules: A review. J. Micro-Bio Robot. 2016, 11, 1–18. [Google Scholar] [CrossRef] [PubMed]
Zhong, J.; Ren, H.; Chen, Q.; Zhang, H. A review of deep learning-based localization, mapping and 3D reconstruction for endoscopy. J. Micro-Bio Robot. 2025, 21, 1. [Google Scholar] [CrossRef]
Than, T.D.; Alici, G.; Harvey, S.; O’Keefe, G.; Zhou, H.; Li, W.; Alam-Fotias, S. An effective localization method for robotic endoscopic capsules using multiple positron emission markers. IEEE Trans. Robot. 2014, 30, 1174–1186. [Google Scholar] [CrossRef]
Than, T.D.; Alici, G.; Zhou, H.; Li, W. A review of localization systems for robotic endoscopic capsules. IEEE Trans. Biomed. Eng. 2012, 59, 2387–2399. [Google Scholar] [CrossRef]
Spyrou, E.; Iakovidis, D.K. Video-based measurements for wireless capsule endoscope tracking. Meas. Sci. Technol. 2013, 25, 015002. [Google Scholar] [CrossRef]
Dimas, G.; Iakovidis, D.K.; Ciuti, G.; Karargyris, A.; Koulaouzidis, A. Visual localization of wireless capsule endoscopes aided by artificial neural networks. In Proceedings of the 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS), Thessaloniki, Greece, 22–24 June 2017; pp. 734–738. [Google Scholar]
Zhang, R.; Peng, B.; Liu, Y.; Liu, X.; Huang, J.; Suzuki, K.; Zhu, X. Localization of capsule endoscope in alimentary tract by computer-aided analysis of endoscopic images. Sensors 2025, 25, 746. [Google Scholar] [CrossRef]
Li, R.; Wang, S.; Gu, D. Ongoing evolution of visual SLAM from geometry to deep learning: Challenges and opportunities. Cogn. Comput. 2018, 10, 875–889. [Google Scholar] [CrossRef]
Turan, M.; Almalioglu, Y.; Konukoglu, E.; Sitti, M. A deep learning-based 6 degree-of-freedom localization method for endoscopic capsule robots. arXiv 2017, arXiv:1705.05435. [Google Scholar]
Turan, M.; Ornek, E.P.; Ibrahimli, N.; Almalioglu, Y.; Sitti, M. Unsupervised odometry and depth learning for endoscopic capsule robots. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1801–1807. [Google Scholar]
Almalioglu, Y.; Ozyoruk, K.B.; Gokce, A.; Yip, M.C.; Sitti, M. EndoL2H: Deep super-resolution for capsule endoscopy. IEEE Trans. Med. Imaging 2020, 39, 4297–4309. [Google Scholar] [CrossRef]
Recasens, D.; Lamarca, J.; Fácil, J.M.; Radeva, P.; Moreno-Noguer, F. Endo-Depth-and-Motion: Reconstruction and tracking in endoscopic videos using depth networks and photometric constraints. IEEE Robot. Autom. Lett. 2021, 6, 7225–7232. [Google Scholar] [CrossRef]
Wang, Y.; Long, Y.; Fan, S.H.; Bai, X.; Li, X. Neural rendering for stereo 3D reconstruction of deformable tissues in robotic surgery. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2022, Singapore, 18–22 September 2022; Springer: Cham, Switzerland, 2022; pp. 431–441. [Google Scholar]
Zha, R.; Cheng, X.; Li, H.; Xue, Z.; Zheng, X.; Fang, T. EndoSurf: Neural surface reconstruction of deformable tissues with stereo endoscope videos. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2023, Vancouver, BC, Canada, 8–12 October 202; Springer: Cham, Switzerland, 2023; pp. 13–23. [Google Scholar]
Xie, C.; Yao, T.; Wang, J.; Zhang, Q. Endoscope localization and gastrointestinal feature map construction based on monocular SLAM technology. J. Infect. Public Health 2020, 13, 1314–1321. [Google Scholar] [CrossRef] [PubMed]
Münzer, B.; Schoeffmann, K.; Böszörmenyi, L. Content-based processing and analysis of endoscopic images and videos: A survey. Multimed. Tools Appl. 2018, 77, 1323–1362. [Google Scholar] [CrossRef]
Oliveira, M.; Araujo, H.; Figueiredo, I.N.; Pinto, L.; Curto, E.; Perdigoto, L. Registration of consecutive frames from wireless capsule endoscopy for 3D motion estimation. IEEE Access 2021, 9, 119533–119545. [Google Scholar] [CrossRef]
Xu, Y.; Feng, L.; Xia, Z.; Xiong, J. Camera Pose Estimation Based on Feature Extraction and Description for Robotic Gastrointestinal Endoscopy. In Proceedings of the Intelligent Robotics and Applications, ICIRA 2021, Yantai, China, 22–25 October 2021; Lecture Notes in Computer Science; Liu, X.J., Nie, Z., Yu, J., Xie, F., Song, R., Eds.; Springer: Cham, Switzerland, 2021; Volume 13015. [Google Scholar] [CrossRef]
Lu, S.; Liu, S.; Hou, P.; Yang, B.; Liu, M.; Yin, L.; Zheng, W. Soft tissue feature tracking based on deep matching network. CMES–Comput. Model. Eng. Sci. 2023, 136, 363–379. [Google Scholar] [CrossRef]
Nazifi, N.; Araujo, H.; Erabati, G.K.; Tahri, O. Self-supervised monocular pose and depth estimation for wireless capsule endoscopy using transformers. In Proceedings of the Medical Imaging 2024: Image-Guided Procedures, Robotic Interventions, and Modeling, San Diego, CA, USA, 18–23 February 2024; SPIE: San Diego, CA, USA, 2024; Volume 12928, pp. 252–262. [Google Scholar]
Scaradozzi, D.; Zingaretti, S.; Ferrari, A. Simultaneous localization and mapping (SLAM) robotics techniques: A possible application in surgery. Shanghai Chest 2018, 2, 5. [Google Scholar] [CrossRef]
Picard, Q.; Chevobbe, S.; Darouich, M.; Didier, J.-Y. A survey on real-time 3D scene reconstruction with SLAM methods in embedded systems. arXiv 2023, arXiv:2309.05349. [Google Scholar] [CrossRef]
Ai, Y.-B.; Rui, T.; Yang, X.-Q.; He, J.-L.; Fu, L.; Li, J.-B.; Lu, M. Visual SLAM in dynamic environments based on object detection. Def. Technol. 2021, 17, 1712–1721. [Google Scholar] [CrossRef]
Sheng, L.; Xu, D.; Ouyang, W.; Wang, X. Unsupervised collaborative learning of keyframe detection and visual odometry towards monocular deep SLAM. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4302–4311. [Google Scholar]
Hsiao, M.; Westman, E.; Zhang, G.; Kaess, M. Keyframe-based dense planar SLAM. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 5110–5117. [Google Scholar]
Aloui, K.; Guizani, A.; Hammadi, M.; Haddar, M.; Soriano, T. Systematic literature review of collaborative SLAM applied to autonomous mobile robots. In Proceedings of the 2022 IEEE Information Technologies and Smart Industrial Systems (ITSIS), Sfax, Tunisia, 17–19 November 2022; pp. 1–5. [Google Scholar]
Zhang, X.; Liu, Q.; Zheng, B.; Wang, H.; Wang, Q. A visual simultaneous localization and mapping approach based on scene segmentation and incremental optimization. Int. J. Adv. Robot. Syst. 2020, 17, 1729881420977669. [Google Scholar] [CrossRef]
Al-Tawil, B.; Hempel, T.; Abdelrahman, A.; Al-Hamadi, A. A review of visual SLAM for robotics: Evolution, properties, and future applications. Front. Robot. AI 2024, 11, 1347985. [Google Scholar] [CrossRef]
Ma, Z.; Wang, T.; Xu, S.; Mu, X.; Wang, Q.; Guo, Q. Moving object detection based on Farneback optical flow. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; pp. 7350–7355. [Google Scholar]
Wang, Z.; Yang, X. Moving target detection and tracking based on Pyramid Lucas-Kanade optical flow. In Proceedings of the 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC), Chongqing, China, 27–29 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 66–69. [Google Scholar]
Hasanah, S.A.; Pravitasari, A.A.; Abdullah, A.S.; Yulita, I.N.; Asnawi, M.H. A deep learning review of ResNet architecture for lung disease identification in CXR image. Appl. Sci. 2023, 13, 13111. [Google Scholar] [CrossRef]
Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697–8710. [Google Scholar] [CrossRef]
Zoph, B.; Le, Q.V. Neural architecture search with reinforcement learning. arXiv 2016, arXiv:1611.01578. [Google Scholar] [CrossRef]
Albawi, S.; Mohammed, T.A.; Al-Zawi, S. Understanding of a convolutional neural network. In Proceedings of the 2017 International Conference on Engineering and Technology (ICET), Antalya, Turkey, 21–23 August 2017; pp. 1–6. [Google Scholar] [CrossRef]
Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
Saber, H.A.; Younes, A.; Osman, M.; Elkabani, I. Quran reciter identification using NASNetLarge. Neural Comput. Appl. 2024, 36, 6559–6573. [Google Scholar] [CrossRef]
Farnebäck, G. Two-frame motion estimation based on polynomial expansion. In Proceedings of the Scandinavian Conference on Image Analysis, Halmstad, Sweden, 29 June–2 July 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 363–370. [Google Scholar]
Aminfar, A.; Davoodzadeh, N.; Aguilar, G.; Princevac, M. Princevac, Application of optical flow algorithms to laser speckle imaging. Microvasc. Res. 2019, 122, 52–59. [Google Scholar] [CrossRef] [PubMed]
Nemade, N.; Gohokar, V.V. Comparative performance analysis of optical flow algorithms for anomaly detection. In Proceedings of the International Conference on Communication and Information Processing (ICCIP), Chongqing, China, 15–17 November 2019. [Google Scholar]

Figure 1. Farneback dense optical flow.

Figure 2. Workflow of the hybrid ResNet-50–optical flow.

Figure 3. Phantom test environment.

Figure 4. The proposed capsule scheme.

Figure 5. Example frames captured from the phantom environment. These images illustrate the raw inputs to both the CNN and optical flow.

Figure 6. Keypoints detected by SIFT (a) and ORB (b) on phantom images. These points represent salient local features used for feature-based localization.

Figure 7. Feature correspondences between consecutive frames (a) using SIFT features and (b) using ORB features. The color-coded lines indicate matched keypoints. This visualization highlights the instability of feature-based methods in low-texture regions. The color-coded lines represent the correspondences between matched keypoints in consecutive frames; the different colors are used only for visualization.

Figure 8. SLAM Localization with (a) SIFT and (b) ORB. Example region (marked by a rectangle) where the SLAM algorithm failed to produce estimated camera poses due to low contrast and limited visual features. The insufficient number of reliable keypoints between consecutive frames caused tracking loss and disrupted the trajectory estimation.

Figure 9. Localization result of hybrid ResNet-50–Farneback optical flow. The estimated trajectory is shown as a blue line, whereas the red dots correspond to the image acquisition points.

Figure 10. Train and validation loss and MAE of the proposed hybrid method.

Figure 11. The displacement over frames with optical flow.

Figure 12. Localization result of Resnet50 + Lucas–Kanade optical flow. The estimated trajectory is shown as a blue line, whereas the red dots correspond to the image acquisition points.

Figure 13. Train and validation loss and MAE of the Resnet50 + Lucas–Kanade optical flow.

Figure 14. Localization result with fine-tuned ResNet-50. The estimated trajectory is shown as a blue line, whereas the red dots correspond to the image acquisition points.

Figure 15. Train and validation loss and MAE values of ResNet-50 implementation.

Figure 16. Localization results of NASNetLarge. The estimated trajectory is shown as a blue line, whereas the red dots correspond to the image acquisition points.

Figure 17. Training and validation loss and MAE values of NASNetLarge implementation.

Figure 18. RMSE comparison of pose estimation models.

Figure 19. The effect of CNN weight (α) on RMSE.

Table 1. Architectural configurations and training parameters of the DL-based models.

Parameter	ResNet-50 (Fine-Tuned)	NASNetLarge (Fine-Tuned)	ResNet-50 (Fine-Tuned) + Farneback Optical Flow (Hybrid) and ResNet-50 (Fine-Tuned) + Lucas–Kanade Optical Flow
Backbone	ResNet-50	NASNetLarge	ResNet-50
Pretrained Weights	ImageNet	ImageNet	ImageNet
Trainable Layers	Last 30 layers + regression head	Last 30 layers + regression head	Last 30 layers + regression head
Epochs	20	20	20
Batch Size	8	8	8
Optimizer	Adam	Adam	Adam
Learning Rate	1 × 10⁻⁴	1 × 10⁻⁴	1 × 10⁻⁴
Loss Function	MSE	MSE	MSE
Pose Estimation Method	CNN-only (cumulative)	CNN-only (cumulative)	CNN + Optical Flow (weighted)
Dataset	Monocular endoscopic test area images	Monocular endoscopic test area images	Monocular endoscopic test area images

Table 2. Comparison of different CNN-based models for monocular pose estimation in terms of RMSE, training time, and average inference time per frame.

Model	RMSE (cm)	Training Time (s)	Inference Time (ms/Frame)
ResNet-50	0.39	373.75	316.57
ResNet-50 (Fine-Tuned) + Farneback Optical Flow (Hybrid)	0.03	393.11	241.84
ResNet-50 (Fine-Tuned) + Lucas–Kanade Optical Flow	0.05	419.63	286.34
NASNetLarge	1.46	876.54	529.66

Table 3. The rotational errors of the models in roll, pitch, and yaw.

Model	Roll (Radian)	Pitch (Radian)	Yaw (Radian)
ResNet-50	0.01	0.02	0.05
ResNet-50 (Fine-Tuned) + Farneback Optical Flow (Hybrid)	0.04	0.013	0.015
ResNet-50 (Fine-Tuned) Lucas–Kanade Optical Flow	0.09	0.04	0.011
NASNetLarge	0.14	0.13	0.08

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yakar, İ.; Kuçak, R.A.; Bilgi, S.; Ferhanoglu, O.; Akinci, T.C. A Hybrid Deep Learning and Optical Flow Framework for Monocular Capsule Endoscopy Localization. Electronics 2025, 14, 3722. https://doi.org/10.3390/electronics14183722

AMA Style

Yakar İ, Kuçak RA, Bilgi S, Ferhanoglu O, Akinci TC. A Hybrid Deep Learning and Optical Flow Framework for Monocular Capsule Endoscopy Localization. Electronics. 2025; 14(18):3722. https://doi.org/10.3390/electronics14183722

Chicago/Turabian Style

Yakar, İrem, Ramazan Alper Kuçak, Serdar Bilgi, Onur Ferhanoglu, and Tahir Cetin Akinci. 2025. "A Hybrid Deep Learning and Optical Flow Framework for Monocular Capsule Endoscopy Localization" Electronics 14, no. 18: 3722. https://doi.org/10.3390/electronics14183722

APA Style

Yakar, İ., Kuçak, R. A., Bilgi, S., Ferhanoglu, O., & Akinci, T. C. (2025). A Hybrid Deep Learning and Optical Flow Framework for Monocular Capsule Endoscopy Localization. Electronics, 14(18), 3722. https://doi.org/10.3390/electronics14183722

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Deep Learning and Optical Flow Framework for Monocular Capsule Endoscopy Localization

Abstract

1. Introduction

2. Related Studies

3. Methodology

3.1. V-SLAM

3.2. Farneback Optical Flow

3.3. Lucas–Kanade Optical Flow

3.4. DL Architectures

3.4.1. ResNet-50 Model

3.4.2. Neural Architecture Search Network (NASNet) Model

3.5. Hybrid CNN–Optical Flow Approach

3.5.1. CNN-Based Absolute Pose Regression

3.5.2. Optical-Flow-Based Relative Motion Estimation

3.5.3. Pose Fusion and Final Output

4. Implementation

4.1. Circumferential-Scanning Endoscope

4.2. SLAM Application

4.3. Hybrid CNN and Optical Flow Application

4.4. ResNet-50 and NASNetLarge Application

5. Results and Discussion

Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI