A Unified Deep Learning-Based Corridor Following with Image-Based Obstacle Avoidance for Autonomous Wheelchair Navigation

Abdul Hafez, A. H.

doi:10.3390/math14101698

Open AccessArticle

A Unified Deep Learning-Based Corridor Following with Image-Based Obstacle Avoidance for Autonomous Wheelchair Navigation

by

A. H. Abdul Hafez

Department of Computer Science, College of Computer Sciences and Information Technology, King Faisal University, Al-Ahsa 31982, Saudi Arabia

Mathematics 2026, 14(10), 1698; https://doi.org/10.3390/math14101698

Submission received: 4 April 2026 / Revised: 5 May 2026 / Accepted: 12 May 2026 / Published: 15 May 2026

(This article belongs to the Topic AI and Data-Driven Advancements in Industry 4.0, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Autonomous wheelchair navigation requires both reliable global guidance and safe local interaction with the environment, typically addressed using separate perception and control strategies. This paper presents a unified vision-based control framework that combines learning-based corridor following with image-based obstacle avoidance under a common visual servoing perspective. This work provides a unified interpretation of learning-based and analytical control as complementary realizations of visual servoing. A convolutional neural network (CNN) is employed to directly predict steering commands from monocular images, enabling robust corridor following without explicit feature extraction. In parallel, obstacle avoidance is formulated as an image-based visual servoing (IBVS) task, where detected obstacles are represented as image features and regulated toward safe regions. A supervisory control strategy coordinates these components by prioritizing safety-critical avoidance when necessary, while maintaining nominal navigation otherwise. The system is implemented using a single monocular camera and deployed on a low-cost embedded platform. Experimental results demonstrate that the CNN-based module maintains stable performance under challenging visual conditions, while the IBVS controller provides predictable and reliable avoidance behavior. The proposed framework highlights the complementary roles of learning-based and analytical visual servoing, offering a practical and scalable solution for assistive autonomous mobility.

Keywords:

autonomous wheelchair; deep learning; assistive robotics; edge AI; Raspberry Pi; corridor following; obstacle avoidance; visual servoing

MSC:

68T07

1. Introduction

Autonomous assistive mobility has emerged as a critical research direction for enhancing the independence and safety of individuals with limited motor capabilities. Smart wheelchairs, in particular, offer a promising solution for enabling users to navigate safely in indoor and semi-structured environments such as corridors, hospitals, entrances, and sidewalks. However, achieving reliable autonomy in such systems requires the seamless integration of robust perception, safe motion control, and cost-effective hardware deployment, which remains a challenging problem.

Recent surveys highlight the wide range of approaches proposed for autonomous wheelchair navigation [1,2]. A significant portion of this research focuses on indoor corridor following, where image-based visual servoing (VS) has proven effective [3,4,5,6,7]. In contrast, solutions for broader navigation scenarios, including semi-structured and outdoor environments, often rely on complex and expensive sensing and computation pipelines. These include LiDAR-based SLAM systems [8,9], map-centric planning frameworks [10,11], inverse reinforcement learning approaches [12,13], and large-scale vision models [14]. While effective, such systems are difficult to deploy in low-cost assistive platforms.

Vision-based alternatives provide a more accessible direction, but many existing approaches still focus on isolated subtasks or depend on multi-stage perception pipelines. For example, some methods combine semantic segmentation with cost-map fusion [15], while others address specialized problems such as local planning under occlusions [16] or obstacle avoidance [17]. As a result, there remains a need for a lightweight, camera-only framework that can provide both reliable navigation and safety in real-world assistive scenarios.

Traditional vision-based wheelchair navigation methods are predominantly grounded in visual servoing frameworks, where predefined geometric features are extracted from images and used to compute control commands [18,19,20,21]. In corridor-following tasks, features such as vanishing points and vanishing lines are commonly used because of their geometric relationship with structured environments. However, these methods depend critically on the accurate detection of such features. In practice, environmental noise, illumination variations, motion blur, and occlusions often degrade feature quality, leading to unstable or undefined control behavior when features are partially or completely lost.

To overcome these limitations, recent research has explored learning-based approaches for visual navigation. Convolutional neural networks (CNNs) have been used to learn end-to-end mappings from images to control commands, thereby bypassing explicit feature extraction [22,23]. In these approaches, perception and control are implicitly integrated within a single model, improving robustness in complex and dynamic environments. Reinforcement learning methods have also demonstrated the potential of learning control policies directly from visual input [24]. Nevertheless, purely learning-based systems often lack interpretability and may exhibit unpredictable behavior in safety-critical situations, particularly in the presence of obstacles or under distribution shifts.

Obstacle detection and avoidance constitute another essential component of autonomous wheelchair navigation. Existing solutions employ a variety of sensing modalities, including LiDAR, RGB-D cameras, and ultrasonic sensors [25,26,27]. While these sensors provide accurate spatial information, they significantly increase system cost and complexity. Vision-based alternatives, including classical image processing methods and deep learning-based detectors [28,29,30], offer a more practical solution. Recent advances in lightweight architectures, such as MobileNet-based detectors, enable real-time object detection on embedded platforms. However, integrating such perception modules with stable and reliable control strategies remains a key challenge.

In this work, we address these limitations by presenting a unified vision-only framework for autonomous wheelchair navigation that combines learning-based and model-based control under a common perspective. The proposed system integrates three main components. First, a CNN-based corridor-following module directly predicts angular velocity from monocular images, replacing traditional feature extraction and control pipelines. Second, a lightweight object detection model is deployed on a Raspberry Pi platform to identify obstacles in real time using a single camera. Third, an Image-Based Visual Servoing (IBVS) control law is formulated to generate obstacle avoidance maneuvers in the image space based on detected bounding-box features.

A hybrid control strategy is introduced to coordinate these components through two complementary modes: a nominal navigation mode driven by the CNN, and a safety-critical avoidance mode activated when obstacles are detected within a predefined region of interest. This design combines the robustness and adaptability of learning-based navigation with the stability and interpretability of analytical control. The entire system is designed for deployment on a low-cost embedded platform using a monocular camera, demonstrating that advanced autonomous navigation capabilities can be achieved without reliance on expensive sensing or computation resources.

Experimental validation on real-world datasets and a physical wheelchair platform demonstrates the effectiveness of the proposed approach. The CNN-based module achieves robust corridor following under noisy and previously unseen conditions, while the IBVS-based controller ensures reliable obstacle avoidance. The system operates in real time under resource constraints, highlighting its practical applicability. It is important to note that corridor-following experiments were conducted on a high-performance platform for model validation, while obstacle detection and avoidance were evaluated on an embedded Raspberry Pi platform to demonstrate edge feasibility.

Finally, this work shows that corridor following and obstacle avoidance can be interpreted within a unified visual servoing framework, where learning-based and analytical methods represent complementary realizations of the same control principle.

1.1. Contributions

To address these challenges, this paper makes the following contributions:

We propose a unified vision-based control framework that interprets corridor following as an implicit, learned visual servoing process and obstacle avoidance as an explicit IBVS task, demonstrating that learning-based and analytical control can be viewed as complementary realizations of the same image-based error regulation principle.
We formulate corridor following as an implicit, learned visual servoing process and obstacle avoidance as an explicit IBVS task, thereby highlighting the complementary roles of data-driven and analytical control within a single system.
Unlike earlier corridor-following works that rely on hand-crafted vanishing features, and unlike generic deep visual-servoing works that are not developed for wheelchair corridor navigation, the proposed method learns the steering command directly from corridor imagery and embeds it within a unified assistive navigation controller.
We design a hybrid control architecture with supervisory logic that prioritizes safety-critical obstacle avoidance while preserving stable corridor-following behavior, highlighting how learning-based robustness to visual uncertainty and model-based stability in safety-critical situations jointly enable reliable autonomous navigation.
We demonstrate the feasibility of deploying the complete perception–control pipeline on a low-cost embedded platform using a monocular camera, a compact ResNet-18 corridor-following backbone, and a MobileNetV2-SSD obstacle detector, enabling practical assistive autonomy without expensive sensors.
We provide experimental evaluation of both modules and discuss their integrated behavior under noisy conditions, previously unseen environments, real-world deployment constraints, and current limitations requiring future quantitative study.

1.2. Organization of the Paper

The remainder of this paper is organized as follows. Section 2 reviews related work in visual servoing, deep learning-based navigation, obstacle detection and avoidance, and embedded vision for assistive robotics. Section 3 presents the overall system architecture. Section 4 introduces the unified visual servoing perspective adopted in this work. Section 5 describes the CNN-based corridor-following module. Section 6 presents the obstacle detection and image-based avoidance module. Section 7 details the control arbitration and integration strategy. Section 8 describes the hardware platform and edge implementation. Section 9 reports the experimental results. Section 10 discusses the main findings, limitations, and future directions. Finally, Section 11 concludes the paper.

2. Related Work

2.1. Visual Servoing for Corridor Following

Corridor following has been extensively studied within the framework of image-based visual servoing (IBVS), where visual features extracted from images are used to guide robot motion. Early works rely on geometric features such as vanishing points and vanishing lines to represent the structure of corridor environments [18,19,20,21]. These features are typically derived from line segments and incorporated into control laws to generate corrective angular velocities that align the robot with the corridor centerline.

In [18], a vanishing point-based approach is proposed for autonomous wheelchair navigation, where both the horizontal position of the vanishing point and the orientation of the vanishing line are used as control features. This approach is further extended in [19] to handle more complex navigation tasks such as doorway traversal. Related formulations in [20,21] demonstrate the effectiveness of IBVS for mobile robot navigation in structured environments.

Despite their theoretical grounding, these methods are highly dependent on reliable feature extraction. In real-world conditions, factors such as illumination changes, motion blur, occlusions, and texture variability often degrade feature quality. As a result, the control signal may become unstable or undefined when the required features are partially or completely lost. These limitations motivate the need for approaches that are less reliant on explicit geometric feature engineering.

2.2. Learning-Based Visual Navigation

To overcome the limitations of traditional visual servoing, recent research has explored learning-based approaches that directly map visual input to control commands. Convolutional neural networks (CNNs) have been used to learn task-relevant representations from data, eliminating the need for handcrafted features.

Several works adopt a hybrid strategy where deep learning is used for perception while control remains model-based. For instance, ref. [22] employs a FlowNet-based architecture to estimate relative pose differences between images, which are then used within a control loop. Similarly, ref. [23] uses AlexNet to regress pose differences, followed by a classical control law. While effective, these approaches still separate perception and control.

More recent works move toward end-to-end learning, where the network directly predicts control commands from raw images. Reinforcement learning approaches, such as [24], further demonstrate the ability to learn navigation policies that are robust to noise and environmental variability. These methods improve robustness compared to traditional approaches but often lack interpretability and may exhibit unpredictable behavior in safety-critical situations, particularly when encountering obstacles or operating under distribution shifts. In this sense, CNN-based prediction of a steering-related quantity from images is not entirely new. The distinction of the present work is that it addresses monocular corridor following for an assistive wheelchair, uses corridor-geometry-derived supervisory labels, and then combines the learned steering policy with an explicit IBVS obstacle-avoidance controller in a single framework.

2.3. Vision-Based Obstacle Detection and Avoidance

Obstacle avoidance is a critical component of autonomous wheelchair navigation. Many existing systems rely on multi-sensor configurations, including LiDAR, RGB-D cameras, radar, and ultrasonic sensors [25,26,27], to obtain accurate spatial information. While effective, these solutions increase system cost and complexity, limiting their applicability in assistive contexts.

Vision-based approaches provide a more cost-effective alternative. Classical methods based on image processing techniques have been explored [28,29,30], but their performance is often constrained by the robustness of feature extraction. More recently, deep learning-based object detection models have demonstrated strong performance in identifying obstacles from monocular images. Lightweight architectures, such as MobileNetV2 combined with SSD detectors, enable real-time inference on embedded platforms.

However, most existing works treat obstacle detection as an isolated perception task and do not integrate it with a principled control framework. As a result, the transition from perception to safe and reliable motion control remains a key challenge.

2.4. Edge AI for Assistive Robotics

The emergence of edge AI has enabled the deployment of intelligent perception and control systems on low-cost embedded platforms. Hardware such as the Raspberry Pi, combined with efficient deep learning architectures, allows real-time processing without reliance on high-performance computing resources.

Lightweight models such as MobileNetV2 [31], which utilize depth-wise separable convolutions, are particularly well suited for resource-constrained environments. When integrated with single-shot detection frameworks such as SSD [32], they provide a practical solution for real-time object detection on embedded systems. Platforms such as Edge Impulse further facilitate end-to-end development pipelines for embedded machine learning applications.

Despite these advances, achieving reliable system-level performance remains challenging due to trade-offs between accuracy, latency, and computational cost. In particular, integrating multiple modules—such as navigation and obstacle avoidance—within a single embedded system requires careful architectural design.

2.5. Research Gap and Position of This Work

Based on the above discussion, related studies from literature can be summarized in Table 1. It is evident that existing approaches address individual aspects of autonomous wheelchair navigation, but a unified and practical solution remains lacking. Traditional visual servoing methods provide interpretable and theoretically grounded control but suffer from limited robustness due to their dependence on explicit feature extraction. Learning-based approaches improve robustness but may lack reliability in safety-critical scenarios. Similarly, obstacle detection methods have advanced significantly, yet their integration with stable control strategies remains limited.

This work addresses these limitations by proposing a unified vision-based framework that combines learning-based and analytical control under a common visual servoing perspective. Specifically, corridor following is realized as an implicit, learned visual servoing process, while obstacle avoidance is formulated as an explicit IBVS control task. A hybrid control architecture coordinates these components through supervisory switching, enabling reliable navigation and safe operation.

By integrating perception and control within a vision-only system and deploying it on a low-cost embedded platform, this work bridges the gap between robustness, safety, and practical implementation in assistive robotics. Unlike prior works that treat these components independently, this paper integrates them within a unified control framework.

3. System Overview

This section presents the overall architecture of the proposed unified vision-based autonomous wheelchair system. The framework integrates deep learning-based corridor following with image-based obstacle detection and avoidance within a single perception–control loop, operating entirely on monocular visual input.

Figure 1 illustrates the proposed system. A forward-facing monocular camera continuously captures the scene in front of the wheelchair. The captured image is simultaneously processed by two parallel modules: a CNN-based corridor-following module and a lightweight obstacle detection module. The outputs of these modules are coordinated by a supervisory logic, which determines whether the system should follow the corridor or execute an avoidance maneuver using an image-based visual servoing (IBVS) controller.

3.1. Overall Pipeline

At each time step, an image

I_{t}

acquired from the camera is processed through the following pipeline:

Corridor-Following (CNN): The image is passed through a trained convolutional neural network that directly predicts an angular velocity command $ω_{cnn}$ , enabling the wheelchair to align with the corridor centerline without explicit feature extraction.
Obstacle Detection: In parallel, the same image is processed by a lightweight object detection model to identify obstacles. The detector outputs bounding boxes, from which the center $(u, v)$ of the most relevant obstacle is extracted.
IBVS-Based Avoidance: When an obstacle is detected within a predefined area of interest (AoI), an IBVS controller computes avoidance velocities $(ν_{ibvs}, ω_{ibvs})$ based on the image-space error between the current feature location $(u, v)$ and a desired safe location $(u^{*}, v^{*})$ .
Decision and Control Fusion: A supervisory logic selects or prioritizes the appropriate control command depending on the presence of obstacles.
Actuation: The final control command is converted into motor signals and applied to the wheelchair platform.

This pipeline enables continuous perception and real-time control, allowing the system to respond dynamically to both navigation and safety requirements.

3.2. Operational Modes

The system operates under three main modes, governed by the supervisory logic:

Corridor-Following Mode: When no obstacle is detected within the AoI, the system relies on the CNN output $ω_{cnn}$ to perform nominal navigation. A constant forward velocity $ν^{*}$ is maintained, and the wheelchair follows the corridor trajectory.
Obstacle-Avoidance Mode: When an obstacle enters the AoI, the system switches to IBVS-based control. In this mode, the control objective is to drive the obstacle’s image feature $(u, v)$ toward a predefined safety margin $(u^{*}, v^{*})$ , effectively steering the wheelchair away from the obstacle.

3.3. Control Switching Strategy

The supervisory logic determines the final control command applied to the wheelchair as:

(ν, ω) = \{\begin{matrix} (ν^{*}, ω_{cnn}), & if no obstacle is detected in AoI \\ (ν_{ibvs}, ω_{ibvs}), & if obstacle is detected in AoI \end{matrix}

(1)

This switching mechanism ensures that safety-critical obstacle avoidance takes precedence over nominal navigation while preserving stable corridor-following behavior in obstacle-free conditions.

3.4. Image-Plane Representation and Area of Interest

The obstacle avoidance process is formulated in the image plane. A rectangular region in front of the wheelchair is defined as the area of interest in the physical world. Due to perspective projection, this region appears in the image as a trapezoidal shape with a wider base at the bottom and a narrower top.

Only obstacles whose bounding boxes intersect this AoI are considered for avoidance. The center of the detected bounding box

(u, v)

is used as the visual feature, and a desired feature location

(u^{*}, v^{*})

is defined near the image boundaries corresponding to a safety margin. The IBVS controller regulates the error between these two points to generate avoidance motion.

3.5. Main Assumptions

The proposed system is designed under the following assumptions:

A single forward-facing monocular camera is used as the only sensing modality.
At most one dominant obstacle is considered within the AoI at any given time.
Corridor following represents the nominal navigation task in structured environments.
The wheelchair operates at low speeds, suitable for assistive applications, ensuring safe and stable control.

3.6. Design Rationale

The architecture combines the complementary strengths of learning-based and model-based approaches. The CNN-based module provides robustness to noise, occlusions, and feature absence by learning navigation behavior directly from data. In contrast, the IBVS-based controller offers a mathematically grounded and interpretable mechanism for obstacle avoidance, ensuring predictable responses in safety-critical situations.

The supervisory logic enables seamless integration of these components, allowing the system to dynamically switch between navigation and avoidance. This unified design provides a practical and scalable solution for autonomous wheelchair navigation, particularly under the constraints of low-cost embedded deployment.

4. A Unified Visual Servoing Framework

This section presents a unified interpretation of the proposed navigation system by formulating both corridor following and obstacle avoidance within a common visual servoing framework. Although these tasks are traditionally addressed using different methodologies, we show that they can be expressed as complementary instances of image-based control.

4.1. General Visual Servoing Formulation

Visual servoing aims to regulate a set of visual features s extracted from the image toward a desired configuration

s^{*}

. The control objective is defined through the image-space error:

e = s - s^{*}

(2)

A standard IBVS control strategy enforces exponential convergence of this error:

\dot{e} = - λ e

(3)

where

λ > 0

is a control gain. The relationship between the feature velocity

\dot{s}

and the robot velocity

\dot{r}

is given by:

\dot{s} = J \dot{r}

(4)

where J is the image Jacobian. This formulation provides a general framework for deriving control laws directly in the image space.

4.2. Obstacle Avoidance as Explicit Visual Servoing

The obstacle avoidance component of the proposed system directly follows the IBVS paradigm. The visual feature is defined as the center of the detected bounding box:

s = [\begin{matrix} u \\ v \end{matrix}]

(5)

The desired feature location

s^{*} = (u^{*}, v^{*})

corresponds to a safe region near the image boundary, defined according to a safety margin.

The resulting error:

e = [\begin{matrix} u - u^{*} \\ v - v^{*} \end{matrix}]

(6)

is regulated using the IBVS control law presented in Section 6. This yields analytically derived control commands that guarantee predictable and stable avoidance behavior.

4.3. Corridor Following as Learned Visual Servoing

In contrast to obstacle avoidance, corridor following is implemented using a convolutional neural network that directly predicts the angular velocity from the input image.

ω_{cnn} = F (I; Θ)

(7)

where I denotes the input image and

F (\cdot)

represents a CNN parameterized by

Θ

. This formulation enables a direct mapping from visual observations to control outputs, removing the dependency on intermediate feature extraction.

Although no explicit visual features are computed during inference, the CNN implicitly learns to extract task-relevant representations, such as corridor orientation and alignment. This can be interpreted as a learned approximation of the visual servoing process, where both feature extraction and control law are embedded within the network. From this perspective, the CNN replaces the explicit computation of s,

s^{*}

, and J, and directly estimates the control action required to minimize an implicit visual error.

4.4. Unified Interpretation

Both modules can therefore be viewed as instances of visual servoing:

Corridor following: implicit (learning-based) visual servoing;
Obstacle avoidance: explicit (model-based) visual servoing.

This unified perspective reveals that learning-based and analytical approaches are not fundamentally different, but rather complementary realizations of the same control principle. The CNN provides robustness to noise, occlusions, and feature ambiguity, while the IBVS controller offers interpretability and stability in safety-critical situations. This unified perspective distinguishes our work from prior studies that treat learning-based and analytical methods as separate paradigms, as we explicitly show that both can be derived from the same image-based error regulation framework.

4.5. Implications for System Design

The hybrid architecture adopted in this work naturally follows from this unified formulation. Instead of relying on a single control paradigm, the system combines:

A learned controller for global navigation under uncertainty;
A model-based controller for reliable local safety enforcement.

The supervisory switching strategy described in Section 7 enables seamless coordination between these components, ensuring that obstacle avoidance takes precedence when necessary while maintaining stable corridor-following behavior otherwise.

This formulation provides a principled foundation for integrating learning-based perception and control-theoretic methods in assistive robotic systems, particularly under the constraints of vision-only sensing and edge deployment. This unified formulation provides a conceptual bridge between data-driven and model-based control paradigms.

5. Deep CNN-Based Corridor Visual Navigation

This section describes the proposed deep learning-based framework for autonomous corridor following. The approach reformulates the conventional visual servoing pipeline into an end-to-end learning paradigm, where both perception and control are jointly modeled. Instead of relying on explicit feature extraction followed by a control law, a convolutional neural network (CNN) is trained to directly predict the steering command from raw visual input.

5.1. Problem Formulation

The wheelchair is modeled as a nonholonomic system with two control variables: translational velocity

ν

and angular velocity

ω

. For assistive applications, the forward velocity

ν^{*}

is maintained at a constant low value, and the control problem reduces to estimating the angular velocity required to align the wheelchair with the corridor.

In classical visual servoing, this alignment is achieved by extracting geometric features and computing control commands through a predefined model. In contrast, as depicted in Figure 2, the proposed method formulates the task as a regression problem. Equation (7) represents the learning of this regression model.

5.2. Ground-Truth Generation via Geometric Features

Although the model does not rely on explicit features during inference, geometric cues are utilized during training to generate reliable supervisory signals. As shown in Figure 3, the vanishing point coordinate

x_{v}

and vanishing line orientation

θ_{v}

are used to represent corridor geometry.

The desired configuration corresponds to

(x_{v}, θ_{v}) = (0, 0)

, indicating that the wheelchair is centered and aligned with the corridor. Based on these features, the ground-truth angular velocity is computed using a control law derived from classical visual servoing:

ω = f (x_{v}, θ_{v}, ν^{*}, λ, \dots)

(8)

To ensure high-quality supervision, human annotation is employed to correct feature estimates in challenging cases where automated extraction is unreliable due to noise, occlusion, or illumination changes.

The system assumes a four-wheeled wheelchair with two passive front wheels and two actuated rear wheels, resulting in a nonholonomic motion model. Since forward velocity is kept constant, angular velocity becomes the primary control variable for maintaining alignment.

In traditional approaches, vanishing features are extracted using algorithms such as LSD-based line detection. However, these methods are sensitive to environmental variability and parameter tuning. To address this limitation, manually verified annotations are used to ensure consistency in the generated training labels.

The control law used for generating ground-truth angular velocity is given by:

ω = - J_{w}^{+} (λ e + J_{ν} ν^{*})

(9)

where the Jacobians are defined as:

J_{w} = [\begin{matrix} 1 + x_{v}^{2} \\ - λ_{θ_{v}} l c + λ_{θ_{v}} w ρ + ρ s \end{matrix}]

(10)

J_{ν} = [\begin{matrix} 0 \\ - λ_{θ_{v}} ρ \end{matrix}]

(11)

and the error is defined as:

e = (x_{v}, θ_{v}) - (0, 0)

(12)

The resulting

ω

serves as the target value for training the CNN.

5.3. Corridor Dataset Construction

A comprehensive dataset is constructed by combining publicly available corridor images with data collected from real-world environments. The initial dataset contains 3563 images.

Images for which reliable ground-truth labels cannot be computed—primarily due to vanishing points falling outside the image frame—are excluded. This results in a clean subset used for training. To improve robustness and generalization, synthetic noise is introduced through data augmentation. These include Gaussian blur, motion blur, and JPEG compression artifacts. The final dataset consists of 6320 images, combining both clean and augmented samples.

The dataset is split into training and testing sets in a 90:10 ratio, with a portion of the training set reserved for validation. Figure 4 shows clean and augmented samples from the dataset.

5.4. Network Architecture and Training

A ResNet-18 architecture is adopted due to its balance between representational capacity and computational efficiency. The proposed corridor-following network therefore uses the standard ResNet-18 backbone as a compact baseline, initialized with ImageNet-pretrained weights and fine-tuned for the present regression task rather than redesigned from scratch.

All input images are resized to

224 \times 224

. The final fully connected layer is replaced with a single linear neuron that outputs the predicted angular velocity

ω

. Accordingly, the task-specific modification relative to the original ResNet-18 lies in the regression head and the fine-tuning objective. This choice reduces implementation complexity while preserving sufficient representation capacity for corridor imagery.

Remark 1.

It is important to clarify that the proposed corridor-following model is not a novel CNN architecture but rather a task-specific adaptation of the standard ResNet-18. The only modifications are: (1) replacing the final 1000-way fully connected classification layer with a single linear neuron for angular velocity regression, and (2) fine-tuning all convolutional and batch normalization layers end-to-end from ImageNet-pretrained weights. No architectural changes such as adjusting the number of residual blocks, modifying kernel sizes, or adding attention mechanisms were made, as ResNet-18 already provides an appropriate balance between representational capacity and computational efficiency for our embedded target platform.

Training is performed using a mean squared error loss:

l o s s = \frac{1}{n} \sum_{i = 0}^{n} {(\hat{ω} - ω)}^{2}

(13)

The model is trained using stochastic gradient descent with momentum. The learning rate is set to 0.005, momentum to 0.9, and weight decay to 0.005. Training is conducted over 40 epochs with a batch size of 8. The lightweight character of the corridor-following module therefore comes from three practical design choices: a compact 18-layer residual backbone, low-resolution monocular input, and a scalar regression output instead of a denser multi-branch prediction pipeline.

5.5. Evaluation Metrics

Model performance is evaluated using regression-based metrics. The coefficient of determination (

R^{2}

) is used as the primary metric:

R^{2} (ω, \hat{ω}) = 1 - \sum_{i}^{n - 1} \frac{{(ω_{i} - \hat{ω})}^{2}}{{(ω_{i} - \bar{ω})}^{2}}

(14)

An

R^{2}

value close to 1 indicates strong agreement between predicted and target values.

In addition to

R^{2}

, robustness is assessed under noisy and unreliable conditions, demonstrating the ability of the CNN to maintain stable performance even when traditional feature-based methods fail.

5.6. Comparing Deep and Vanishing Features

Failures of the vanishing-feature-based method in generating reliable

ω

typically stem from inaccuracies in the feature extraction stage. In contrast, the proposed CNN-based approach does not rely on explicit geometric feature detection and is therefore capable of producing meaningful estimates of

ω

even under such failure conditions. To further analyze the behavior of the CNN model, we infer the corresponding deep feature

x_{v}

from the predicted

ω

and compare it with the vanishing point feature

x_{v}

obtained from the classical method using ground-truth data.

Recall the control law from [33] presented in Section 5.2.

ω = - J_{w}^{+} (λ e + J_{ν} ν^{*})

(15)

Here, the Jacobians

J_{ν}

,

J_{w}

and error e are described in Equations (10), (11) and (12), respectively. After substituting these values and the constants defined in the model, we obtain:

\begin{matrix} ω = [\begin{matrix} - 1 - x_{v}^{2} & - ρ sin θ_{v} \end{matrix}] [\begin{matrix} λ x_{v} \\ λ θ_{v} \end{matrix}] + [\begin{matrix} - 1 - x_{v}^{2} & - ρ sin θ_{v} \end{matrix}] [\begin{matrix} 0 \\ - ρ cos θ_{v} ν^{*} / h \end{matrix}] \end{matrix}

(16)

This equation can be reduced to the following:

ω = - λ x_{v} - λ x_{v}^{3} - ρ λ θ_{v} sin θ_{v} + \frac{ρ^{2}}{h} sin θ_{v} cos θ_{v} ν^{*}

(17)

At this stage, the system is described by a single equation involving three unknowns. Consequently, given

ω

, it is not possible to derive a closed-form solution for

x_{v}

,

y_{v}

, and

θ_{v}

. Nevertheless, by constraining the feasible ranges of these variables according to the characteristics of our task, an approximate solution can be obtained. In particular, this allows us to estimate a deep feature corresponding to the vanishing point coordinate

x_{v}

from the predicted

ω

. It is important to emphasize that this estimation is performed solely for the purpose of comparing the behavior of the proposed network with the TVS-based approach.

Since

x_{v}

and

y_{v}

are expressed in meters within the image plane, it is reasonable to assume that their magnitudes remain bounded by

1 m

. As a result, the quantity

ρ = \sqrt{x_{v}^{2} + y_{v}^{2}}

is also bounded by 1. Under this assumption, the last term in Equation (17) is smaller than the dominant terms multiplied by the relatively large constant

λ

and is neglected only for the purpose of obtaining an approximate inverse map from

ω

to

x_{v}

.

From [18], we can also conclude that

θ_{v} \in (- \frac{π}{2}, \frac{π}{2})

. However, in our experiments, we have observed that for most images in the dataset,

θ_{v} \in (- \frac{π}{6}, \frac{π}{6})

. For this reason, the third term in Equation (17) is also neglected in the same approximate analysis. This simplification is intended only to interpret the learned controller and should not be confused with the exact control law used to generate the training labels. We can then solve the following equation for

x_{v}

:

ω = - λ x_{v} - λ x_{v}^{3}

(18)

As this is a cubic equation of the form

x_{v}^{3} + x_{v} + ω / λ = 0

, its Cardano discriminant is

Δ = {(ω / 2 λ)}^{2} + 1 / 27 > 0

. Therefore, the equation admits a unique real root, which can be expressed in terms of

ω

as:

\begin{matrix} x_{v} = \sqrt[3]{\frac{- ω}{2 λ} + \sqrt{(\frac{ω}{2 λ})^{2} + \frac{1}{27}}} + \sqrt[3]{\frac{- ω}{2 λ} - \sqrt{(\frac{ω}{2 λ})^{2} + \frac{1}{27}}} \end{matrix}

(19)

This expression is obtained directly from Cardano’s solution for the depressed cubic

x_{v}^{3} + x_{v} + ω / λ = 0

, where the discriminant is always positive for the values of

ω

encountered in our experiments, guaranteeing a single real root.

The feature

x_{v}

in this context represents a deep feature corresponding to the vanishing point coordinate, inferred from the angular velocity

ω

predicted by the CNN model. This estimate is compared with the corresponding

x_{v}

obtained through the traditional TVS-based feature extraction method, using a common ground truth for evaluation. Figure 5 illustrates the overall comparison framework. For a given input image, we have the vanishing point predicted by the inverse of the learned control law, representing the CNN-based method and denoted as

x_{v}^{cnn}

, the ground-truth reference and denoted as

x_{v}^{gt}

, and finally the vanishing point feature extracted using a geometric TVS-based method, denoted as

x_{v}^{geom}

. The numerical results of this comparison is shown in Section 9.

Remark 2.

To simplify Equation (17) to a tractable form, we examine the magnitude of each term given the characteristics of our dataset and operating conditions. The third term,

- ρ λ θ_{v} sin θ_{v}

, depends on the vanishing angle

θ_{v}

. In our corridor dataset,

θ_{v}

typically lies within

(- 0.2, 0.2)

rad for reasonably aligned images, and

| sin θ_{v} | \leq 0.2

. With

ρ \leq 1

m and

λ \approx 2700

, this term is at most

1 \times 2700 \times 0.2 \times 0.2 = 108

. In contrast, the first term

- λ x_{v}

is at least

2700 \times 0.05 = 135

even for small misalignments, and grows to over 1000 for larger

x_{v}

. Thus the third term is comparable only when

x_{v}

is very small, but in that case the cubic term

- λ x_{v}^{3}

is negligible and the overall ω remains small—errors in this regime do not significantly affect control. The fourth term,

\frac{ρ^{2}}{h} sin θ_{v} cos θ_{v} v^{*}

, contains the small forward velocity

v^{*} = 0.2

m/s and the camera height constant h. Its maximum magnitude is less than 0.01, which is negligible compared to the first term (typically >100). Therefore, with negligible loss of accuracy for control purposes, Equation (17) reduces to the cubic form shown in Equation (18).

5.7. Evaluation on Non-Aligned Corridor Images

During training, robustness to noise was incorporated by augmenting the dataset with corrupted samples. However, a subset of images (Figure 4c) was excluded from training due to the inability of traditional methods to provide reliable ground-truth annotations. Despite this limitation, the trained CNN model is capable of producing meaningful approximations of the angular velocity

ω

for such unreliable inputs.

Although these images are unsuitable for conventional feature-based evaluation, a human observer can often infer the appropriate steering direction (left or right) required to initiate corridor following. Moreover, the predicted direction can be directly determined from the sign of the estimated

ω

. This observation motivates a partial validation strategy based on human annotation.

For each unreliable image, the trained model first produces an estimate of

ω

, which is converted into a binary decision (left or right) based on its sign. The same image is then presented to a human annotator, who provides a corresponding directional label. The agreement between the model prediction and the human annotation is used to compute evaluation metrics. This process is repeated across the dataset, and each annotator evaluates all samples three times, with the final results obtained by averaging the scores.

The accuracy score measures the proportion of samples for which the predicted direction matches the human annotation, reflecting the model’s ability to infer correct motion direction under unreliable conditions:

Accuracy Score = \frac{n_{i}}{n} \times 100

(20)

where

n_{i}

denotes the number of correctly predicted samples and n is the total number of unreliable images.

The false positive score captures the severity of incorrect predictions by averaging the magnitude of

ω

over samples where the predicted direction is wrong:

False Positive Score = \frac{\sum ω_{j}}{n_{j}}

(21)

where

n_{j}

is the number of misclassified samples and

ω_{j}

represents the corresponding angular velocity values.

For reliable autonomous navigation under such challenging conditions, an ideal system should achieve a high accuracy score while maintaining a low false positive score, indicating both correct directional inference and limited deviation in erroneous cases.

6. Obstacle Detection and Image-Based Avoidance

This section describes the obstacle handling component of the proposed system, which combines lightweight visual perception with an image-based control strategy to enable reactive and safe navigation. The primary objective is not merely to detect obstacles, but to translate visual observations into control actions that guide the wheelchair away from potential collisions.

The overall process is captured in an illustration diagram in Figure 6. The image acquired by the vision sensor is fed as an input to the obstacle detection module. The output of this module is a bounding box that defines the location of the obstacle in the image. The obstacle coordinates are sent to the avoidance module to take the required avoidance action.

The proposed method in this section assumes the following settings. The wheelchair is navigating in an out-door environment that is expected to contain a few potential obstacles. Before detecting the obstacle and activating the avoidance module, the wheelchair is guided to achieve the main navigation task. The control design and motion planning used to define the main task is outside the scope of this work. To simplify the presentation, we assumed that the wheelchair was following the sidewalk in a straight line motion through the main task and will resume the main task as soon as the avoidance task is completed.

Once an object is detected inside the defined area of interest (AoI), the avoidance motion is initiated to move the wheelchair in such a way that the obstacle falls outside its field of vision and pathway. The AoI represents a rectangle in front of the wheelchair of 1 m width and 2 m depth. We consider the obstacle inside the AoI when the bottom edge of its bonding box enters the trapezoid area. The design of the control law that achieves this motion is based on an IBVS controller. An error function is defined in the image space and the motion that moves the obstacle outside the image space is generated by regulating the error function to zero.

Remark 3.

It is important to clarify that the core application scenario of the proposed framework includes both indoor corridors (for the nominal navigation task) and structured outdoor pathways such as university campus walkways, building entrances, and sidewalks (for obstacle avoidance). The obstacle detection dataset (WODD) used in this study contains objects relevant to both contexts, including chairs, trash cans, pedestrians, and traffic signs. The IBVS control law operates solely on image-space coordinates (u, v) of the detected bounding box center, independent of scene semantics or indoor/outdoor appearance. Therefore, the outdoor obstacle avoidance experiments are fully consistent with the framework’s intended application range and do not violate the corridor-following focus.

6.1. Overview of the Algorithm

The obstacle handling module operates through two closely coupled stages:

Perception: Identification of relevant obstacles and extraction of compact image-space representations.
Control: Generation of avoidance commands using an image-based visual servoing (IBVS) framework.

Instead of performing full scene reconstruction or depth estimation, the system relies on minimal yet informative visual features. This design significantly reduces computational complexity and enables deployment on resource-constrained embedded platforms.

6.2. Obstacle Detection and Feature Extraction

Given an input image

I_{t}

, the perception module produces a set of detected regions:

B = {b_{i}}_{i = 1}^{N}

(22)

Each detection

b_{i}

corresponds to a bounding box associated with a confidence score and class label. From these candidates, the most relevant obstacle within a predefined area of interest (AoI) is selected.

The center of the selected bounding box is used as the visual feature:

(u, v) = (\frac{x_{min} + x_{max}}{2}, \frac{y_{min} + y_{max}}{2})

(23)

This compact representation provides sufficient information for control, eliminating the need for explicit 3D reasoning or depth estimation.

6.3. Detection Model and Training

To ensure efficient execution on embedded hardware, a lightweight detection architecture based on MobileNetV2 and Single Shot Detection (SSD) is adopted. The model is trained on a dataset comprising typical obstacles encountered in indoor and semi-structured environments, such as pedestrians, chairs, and static objects. The dataset includes variations in illumination, scale, viewpoint, and background complexity to improve robustness. Data augmentation is applied during training to simulate real-world conditions, including illumination changes, Gaussian noise, compression artifacts, and geometric transformations.

Training optimizes a combined loss function that accounts for both localization and classification accuracy. The input resolution is selected to balance detection performance and computational efficiency.

6.4. Edge Deployment Considerations

The perception module is designed for operation on resource-limited hardware. Several strategies are employed to meet real-time constraints:

Efficient backbone: MobileNetV2 reduces computational cost.
Model optimization: Quantization and pruning reduce memory requirements.
Resolution adjustment: Lower input resolutions improve inference speed.

The resulting implementation achieves approximately 1 frame per second on a Raspberry Pi platform, which is sufficient for low-speed assistive navigation scenarios.

6.5. Problem Formulation in Image Space

The avoidance task is formulated directly in the image plane as represented in Figure 7. The wheelchair is assumed to follow a nominal path (e.g., straight corridor motion), and the avoidance module is activated only when an obstacle enters a predefined area of interest (AoI).

The AoI represents a region in front of the wheelchair corresponding to the potential collision zone. In the image plane, this region appears as a trapezoidal area due to perspective projection, as illustrated in Figure 7. Only obstacles intersecting this region are considered relevant, ensuring that avoidance actions are triggered only when necessary.

6.6. Feature Representation and Desired Configuration

The selected obstacle is represented by the feature vector:

s = [\begin{matrix} u \\ v \end{matrix}]

(24)

The control objective is to move this feature toward a predefined safe location

s^{*} = (u^{*}, v^{*})

. The desired horizontal position is defined as:

u^{*} = \{\begin{matrix} I_{c} - \frac{m}{2}, & if u \geq I_{c} \\ \frac{m}{2}, & if u < I_{c} \end{matrix}

(25)

where

I_{c}

is the image center and m represents the safety margin. The vertical coordinate

v^{*}

is typically set equal to v, as lateral displacement is the primary objective.

6.7. Error Definition and IBVS Control Law

The control task is formulated as the regulation of an image-space error:

e = [\begin{matrix} u - u^{*} \\ v - v^{*} \end{matrix}]

(26)

Driving this error to zero in full would imply simultaneous regulation of both image coordinates. In the present application, however, the primary avoidance objective is lateral displacement of the obstacle image. Therefore,

u - u^{*}

is the actively regulated component, while

v - v^{*}

is treated as a consistency condition that remains small during low-speed operation.

The relationship between image feature motion and system velocity is described by:

\dot{s} = J \dot{r}

(27)

where

\dot{r} = {(ν_{y}, ω_{z})}^{T}

represents the wheelchair velocity, and

ν_{y}

and

ω_{z}

correspond to the translational and angular components of

(ν, ω)

in the robot frame.

Recollecting that we only have

ν_{y}

and

ω_{z}

while other velocity components are zeros, We substitute in (27) to be:

(\begin{matrix} \dot{u} \\ \dot{v} \end{matrix}) = (\begin{matrix} \frac{λ}{Z} & 0 & \frac{- u}{Z} & \frac{- u v}{λ} & \frac{λ^{2} + u^{2}}{λ} & - v \\ 0 & \frac{λ}{Z} & \frac{- v}{Z} & \frac{- λ^{2} - v^{2}}{λ} & \frac{u v}{λ} & u \end{matrix}) (\begin{matrix} 0 \\ ν_{y} \\ 0 \\ 0 \\ 0 \\ ω_{z} \end{matrix})

(28)

(\begin{matrix} \dot{u} \\ \dot{v} \end{matrix}) = (\begin{matrix} - v ω_{z} \\ \frac{λ ν_{v}}{Z} + u ω_{z} \end{matrix})

(29)

where u and v are the image point coordinates, and Z is the depth of the 3D point. Also,

λ

is the camera focal length in pixels

λ = \frac{l}{ρ}

where l is focal length in mm and

ρ

is the pixel size in mm. From Raspberry Pi camera V2 specifications, we compute:

λ

= 3.04 mm/0.00112 mm = 2714.3 px. In the ideal case,

v - v^{*} \approx 0

. The control objective is therefore centered on regulating the horizontal image coordinate while maintaining the vertical coordinate approximately constant. The depth is considered constant and equal to

Z = 1.5

m. This approximation is adopted because the wheelchair operates at low speed and the obstacle remains within a limited depth range while it traverses the AoI.

The primary control objective during obstacle avoidance is to drive the horizontal image coordinate u toward the safety margin

u^{*}

, thereby steering the wheelchair away from the obstacle laterally. The vertical coordinate v is not actively regulated; we simply aim to keep the obstacle within the field of view. Therefore, we impose exponential convergence only on the horizontal error:

\dot{u} = - α (u - u^{*}), α > 0 .

(30)

Substituting

\dot{u}

from the Jacobian gives:

- v ω_{z} = - α (u - u^{*}) \Rightarrow ω_{z} = \frac{α (u - u^{*})}{v},

(31)

where v is the current vertical coordinate (non-zero for any detected obstacle). For the translational velocity, we maintain a small constant forward motion to ensure the wheelchair continues moving through the environment while avoiding:

v_{y} = v_{avoid}^{*},

(32)

with

v_{avoid}^{*} = 0.05

m/s in our experiments. The resulting control law thus focuses on horizontal regulation, which is sufficient for safe obstacle avoidance in corridor and sidewalk environments. The coupled regulation of both u and v errors is not required in this application.

Remark 4.

The assumption of constant depth

Z = 1.5

m requires justification. First, note that in our simplified control law, the angular velocity

ω_{z} = α (u - u^{★}) / v

does not depend on Z; the depth appears only in the theoretical expression for v; which is not used for feedback. Second, within the defined area of interest (AoI) of 2 m depth, the actual distance to obstacles varies between approximately 0.5 m and 2.5 m. For low-speed assistive operation

(v_{y} = 0.05 m / s)

, this variation does not qualitatively affect avoidance behavior—the wheelchair still steers laterally to push the obstacle toward the safety margin. Third, constant depth approximations are common in the visual servoing literature when depth cannot be reliably estimated from monocular video. Future implementations could incorporate monocular depth estimation networks or adaptive gain scheduling to relax this assumption.

6.8. Integrated Operation

The perception and control stages operate sequentially:

1.: Detect obstacles and compute $(u, v)$ ;
2.: Check if the obstacle lies within the AoI;
3.: Compute the error e;
4.: Generate control commands using IBVS.

To simplify computation and ensure stability, only the most relevant obstacle is considered at each time step.

The proposed approach establishes a direct mapping from perception to control using only monocular vision. By avoiding explicit depth estimation and 3D reconstruction, the system remains computationally efficient and suitable for embedded deployment. The perception module provides minimal yet sufficient information, while the IBVS controller ensures predictable and stable avoidance behavior. This combination makes the system well-suited for assistive applications where safety, reliability, and affordability are essential.

7. Control Arbitration and Integration

This section introduces the control arbitration mechanism that coordinates the interaction between the CNN-based corridor-following module and the IBVS-based obstacle avoidance controller. The system adopts a hybrid control structure in which a learning-based controller governs nominal navigation, while a model-driven controller is responsible for safety-critical responses.

7.1. Rationale for Hybrid Control

Learning-based controllers are well suited for handling perceptual uncertainty and environmental variability, providing robust performance in diverse conditions. However, their behavior can be difficult to interpret and may lack guarantees in safety-critical scenarios. In contrast, model-based approaches such as IBVS offer predictable and stable responses but are limited in their ability to adapt to complex and unstructured environments.

The proposed framework leverages the complementary strengths of both approaches. The CNN module is responsible for maintaining global alignment during corridor navigation, while the IBVS controller is activated when obstacles are detected, ensuring reliable local avoidance behavior.

7.2. Switching-Based Control Strategy

Control selection is governed by a supervisory mechanism that monitors the presence of obstacles within a predefined area of interest (AoI). Let

B_{t}

denote the set of detected bounding boxes at time t, and let

A

represent the AoI.

An obstacle is considered relevant if:

\exists b_{i} \in B_{t} s . t . b_{i} \cap A \neq \emptyset

(33)

Based on this condition, the control inputs are defined as:

(ν, ω) = \{\begin{matrix} (ν^{*}, ω_{cnn}), & if B_{t} \cap A = \emptyset \\ (ν_{ibvs}, ω_{ibvs}), & otherwise \end{matrix}

(34)

where

ω_{cnn}

is the output of the corridor-following network, and

(ν_{ibvs}, ω_{ibvs})

are generated by the IBVS controller.

This prioritization ensures that obstacle avoidance overrides nominal navigation whenever a potential collision is detected. In implementation, this binary logic is combined with hysteresis on the AoI boundary so that entry and exit are not triggered by the same threshold.

7.3. Transition Management

Smooth transitions between control modes are essential to avoid oscillatory or unstable behavior, particularly near the boundaries of the AoI. To address this, the following mechanisms are implemented:

Hysteresis: Separate activation and deactivation thresholds are used to prevent rapid switching between modes. In practice, the IBVS mode is activated when a detected obstacle enters an inner AoI and released only after it exits a slightly larger outer AoI.
Velocity Constraints: Control outputs from both modules are bounded to ensure gradual and stable actuation.

When no obstacle is present within the AoI, control is returned to the CNN-based module, which naturally re-centers and aligns the wheelchair without requiring explicit trajectory correction.

7.4. Optional Blending Strategy

In addition to discrete switching, a continuous blending approach can be employed to further smooth the transition between navigation and avoidance behaviors. In this case, the angular velocity is computed as:

ω = α ω_{cnn} + (1 - α) ω_{ibvs}

(35)

where

α \in [0, 1]

is a weighting factor that varies according to the proximity of the obstacle to the AoI. As the obstacle approaches the center of the AoI, the contribution of the IBVS controller increases, ensuring a gradual shift toward avoidance behavior.

7.5. Stability and Safety Considerations

The stability of the system during avoidance is ensured by the IBVS controller, which is designed to drive the image-space error toward zero. This guarantees that obstacles are actively displaced toward safer regions in the image.

During normal navigation, the CNN-based controller maintains stable performance provided that the visual input remains consistent with the training distribution. The use of a constant forward velocity

ν^{*}

further simplifies the system dynamics and contributes to stable motion.

Safety is reinforced through:

Prioritizing avoidance behavior when obstacles are detected;
Limiting control commands within safe bounds;
Operating at low speeds appropriate for assistive applications.

Remark 5.

Regarding the switching strategy, the current implementation uses hysteresis thresholds as described in Section 7.3 to mitigate rapid toggling. However, we acknowledge that no experimental data quantifying switching jitter (e.g., oscillation frequency or amplitude) under repeated obstacle entry/exit conditions have been provided. In the experiments conducted (static obstacles and slow-moving pedestrians), noticeable jitter did not occur because the obstacle typically remained inside or outside the AoI for multiple frames. For more dynamic scenarios, the optional continuous blending strategy introduced in Equation (34) represents a promising solution, where α could be continuously varied based on obstacle proximity. A systematic comparison of hard switching vs. blending strategies, including quantitative metrics such as angular velocity variance during transitions, is an important direction for future work.

The proposed arbitration strategy enables effective coordination between learning-based perception and model-based control within a single framework. Rather than relying exclusively on one paradigm, the system exploits their complementary properties:

The CNN module provides robustness and adaptability for navigation;
The IBVS controller ensures reliable and interpretable avoidance;
The supervisory logic enables context-dependent control selection.

This integrated design results in a practical and scalable navigation solution, particularly suited for assistive systems operating under real-world constraints where safety, efficiency, and robustness are critical.

8. Hardware Platform and Edge Implementation

This section outlines the hardware and software configuration used to deploy the proposed system, along with its runtime characteristics. The design emphasizes affordability, simplicity, and feasibility for real-world assistive applications.

The system is implemented on a motorized wheelchair adapted for autonomous operation. The platform consists of:

Drive System: A differential drive mechanism enabling independent control of the left and right wheels.
Control Interface: A low-level controller that converts velocity commands $(ν, ω)$ into motor actuation signals.
Vision Sensor: A forward-facing monocular camera mounted at a fixed position to ensure a stable field of view capturing both corridor structure and obstacles.

The wheelchair operates at low speeds to ensure safe interaction in assistive environments.

8.1. The Experimental Setup and Wheelchair Model

Our developed system runs on Raspberry Pi 4 with a Raspberry Pi camera v2.1 module. We used a Sabertooth driver to control the wheelchair motors and experimented with the wheelchair around our university campus. In the following experiments we will discuss the performance of obstacle detection and avoidance. The wheelchair system along with its kinematic and dynamic models are described in this subsection.

The wheelchair is modelled as a six wheels robot moving on a horizontal plane. The two large rear wheels are actuated by two DC motors. The other four wheels are passive, and used to support the wheelchair. The wheelchair and camera configuration is depicted in Figure 8. The world coordinate frame is defined as

F_{w}

and the wheelchair frame (robot frame) is defined as

F_{r}

. This frame is attached to the middle of the segment formed by the centers of the two deferentially actuated wheels. The camera frame is defined as

F_{c}

. These frames are similarly defined in [19].

The velocity of the unicycle robot shown in Figure 8 in the global coordinate frame where the control inputs

u = (ν, ω)

is given by the following [34].

\begin{matrix} v_{x} & = v cos φ, v_{y} & = ω sin φ, ω_{z} & = ω \end{matrix}

(36)

In our obstacle avoidance task, we controlled the motion of the wheelchair during the avoidance using the linear velocity along Y-axis

ν_{y}

, and the angular velocity around Z-axis

ω_{z}

. These two are calculated finally in the world coordinate frame

F_{w}

. To simplify the design of the low-level controller, we considered normalized values for the velocities

ν_{y}

and

ω_{z}

. These two values are scaled using a manually tuned scale parameters

σ_{ν}

and

σ_{ω}

. The calculated velocities

ν_{y} and ω_{z}

from Equations (31) and (32) are converted to the two left and right wheelchair motors’ actual speed

M_{l}

, and

M_{r}

respectively, they sent as motion commands through the Sabertooth power drive to the motors.

8.2. Training and Evaluating the Obstacle Detection Model

The object detection in this work is built on top of a MobileNetV2 SSD FPN-Lite 320 × 320 model available in the TensorFlow detection model repository and also in Edge Impulse pretrained templates. The WODD dataset can be downloaded publicly too (https://bit.ly/3BPR06U) (accessed on 3 April 2026). The choice of this architecture is motivated by embedded deployment constraints: MobileNetV2 uses depth-wise separable convolutions, the SSD head avoids the overhead of two-stage detection, and the 320 × 320 input resolution provides a practical compromise between detection fidelity and runtime.

Our experiments have reported 61% mAP accuracy after training the model with total of 2559 training images available in our dataset. Different visual effects like gamma and quality adjustment have been added to the original dataset. The experiments show that the current accuracy is accepted in our problem as we don’t have interest in detecting the ‘right’ class of obstacle, rather we care about detecting the obstacle regardless of its class. However, a 66% mAP for MobileNetv2 SSD is reported in [35] after training the model with PASCAL VOC dataset that contains 20 classes. Our model was originally trained using COCO dataset and then fine-tuned by us using our WODD dataset. It was tested using our WODD dataset, which has lower number of classes. This gives some insights into our resultant mPA.

Table 2 presents performance analysis of the retrained model while increasing the number of training images. The effect of adding visual effects to the training images is noticeable in the last row of Table 2, where the mAP value has been boosted from 52.8% to 61%.

By deploying on Raspberry Pi, we achieve a frame rate between 0.8 and 1.6 FPS. The maximum frame rate was obtained by increasing the Raspberry Pi CPU clock to 900 MHz and increasing the Python 3.x interpreter process priority at the OS level.

8.3. Embedded System Implementation and Performance

The proposed perception and control pipeline is implemented on a Raspberry Pi platform, selected for its compact form factor, low power consumption, and suitability for assistive robotic applications. The hardware configuration consists of an ARM-based multi-core processor with 4–8 GB of RAM, along with a CSI camera interface that enables low-latency image acquisition. The system is fully integrated with the wheelchair power supply, allowing standalone operation without reliance on external computing units or additional sensing devices.

The software architecture follows a modular design that integrates perception and control within a unified processing loop. The system is primarily implemented in Python, with deep learning inference performed using frameworks such as PyTorch 1.12.1 or TensorFlow 2.10.0. Image processing operations are handled using OpenCV, while lightweight deployment tools (e.g., TensorFlow Lite) are employed to optimize model execution on embedded hardware. At runtime, the system operates in a sequential loop consisting of image capture, model inference, control computation, and actuation.

The performance of the system is evaluated across different computational platforms to assess its scalability. On GPU-based systems, the framework achieves real-time inference with negligible latency. On standard CPU platforms, the system operates at approximately 10–15 frames per second. On the Raspberry Pi, the embedded implementation achieves a frame rate of approximately 1 FPS, with an inference latency of 900–1000 ms and a control update frequency of around 1 Hz. Although the embedded configuration operates at a lower frame rate, it remains sufficient for low-speed assistive navigation and provides stable and predictable control behavior under real-world conditions. Design choices used to keep the proposed system lightweight are summarised in Table 3.

Remark 6.

For quantitative model complexity, the deployed MobileNetV2-SSD with FPN-Lite backbone contains approximately 3.4 million parameters and requires 0.8 GMACs (giga multiply-accumulate operations) per inference at 320 × 320 resolution. This is substantially lower than standard detection models such as YOLOv3 (≈62 million parameters) or SSD with VGG16 backbone (≈26 million parameters). The low parameter count enables feasible execution on the Raspberry Pi’s ARM Cortex-A72 processor without requiring a GPU or specialized accelerator, directly supporting the paper’s goal of low-cost assistive deployment.

Remark 7.

The 1 Hz control frequency raises important considerations for real-time safety. At a wheelchair translational speed of 0.2 m/s, the wheelchair moves 0.2 m between successive control updates. For a static obstacle, this spatial discretization is acceptable because the obstacle does not move, and the controller can react over multiple frames. For a dynamic obstacle such as a pedestrian walking at 1 m/s (typical human walking speed), the obstacle can move up to 1 m between frames. Given that the area of interest (AoI) extends approximately 2 m in front of the wheelchair, the system would have at most two frames to detect the obstacle and initiate avoidance before a potential collision. In our experiments, only static obstacles and very slow-moving pedestrians (<0.3 m/s) were encountered, and no collisions occurred. For operation in environments with faster dynamic obstacles, three mitigations are possible: (a) reduce the input image resolution from 320 × 320 to 224 × 224 or 160 × 120 to increase frame rate to 3–5 FPS; (b) use a more powerful embedded platform such as an NVIDIA Jetson Nano; or (c) incorporate predictive filtering (e.g., Kalman filter) to estimate obstacle motion between frames. These trade-offs are fundamental to low-cost edge deployment.

9. Experimental Results

This section describes the evaluation environments, testing methodology, and performance metrics used to assess the system. This section also presents the quantitative and qualitative evaluation of the proposed system, including individual modules and the integrated framework. Although evaluated separately, the combined system demonstrates consistent behavior in real-world scenarios, where CNN-based navigation provides global alignment and IBVS ensures local safety.

9.1. Experimental Setup

Evaluation Environments. Experiments are conducted in representative real-world settings, including structured indoor corridors for the nominal corridor-following task and semi-structured spaces such as entrances and sidewalks for the obstacle-detection and avoidance task. Accordingly, the intended application scenario of the system is corridor-centered assistive mobility with local obstacle handling in adjacent access spaces, rather than corridor following alone in a perfectly isolated hallway. Variations in illumination, dynamic obstacles, and visual disturbances are introduced to evaluate robustness under realistic operating conditions.

Evaluation Protocol. The evaluation is structured at three levels. First, corridor following is assessed through the regression performance of the CNN model under clean and degraded conditions. Second, obstacle detection is evaluated across varying environments. Third, the integrated system is analyzed in terms of overall navigation behavior combining both modules. Each experiment is repeated multiple times, and average results are reported.

Evaluation Metrics. Performance is quantified using complementary metrics. Corridor following is evaluated using

R^{2}

and MAE, while detection performance is measured using mAP, precision, and recall. System-level evaluation includes navigation success rate, avoidance success rate, and manual intervention frequency. Computational efficiency is assessed through FPS, latency, and CPU usage.

9.2. Convolutional Neural Network Performance

The performance of the trained model is evaluated on the original test set as well as four additional noisy test sets using the

R^{2}

metric defined in Equation (14). Each noisy dataset contains images corrupted with a specific type of artificial noise. Table 4 reports the

R^{2}

scores obtained across all test sets. The consistency of these scores indicates that the CNN maintains comparable performance under noisy conditions, demonstrating robustness to different types of image degradation.

On the unreliable test set, evaluated using the human verification protocol described in Section 5.7, the proposed model achieves an accuracy score of

78.75 %

in predicting the correct direction of motion across 403 samples. For the subset of 88 misclassified images, a false positive score of

0.180

is obtained. This value corresponds to approximately

5.16 %

of the maximum

ω

observed during testing, indicating that incorrect predictions are associated with relatively small control magnitudes. Consequently, even in failure cases, the resulting motion commands remain limited in severity.

Although this observation is specific to the evaluated dataset, it highlights an important advantage of the proposed CNN-based approach: it yields meaningful, bounded predictions in scenarios where traditional vanishing-feature-based methods produce no reliable output.

9.3. Practical Implementation and Results

We evaluate our method in practice on an Intelligent Wheelchair Platform developed at IIIT, Hyderabad.

A Kinect v2 sensor is integrated into the platform to capture visual data. All computations are performed onboard using a laptop equipped with an NVIDIA 1050 Ti GPU (4 GB memory) and 8 GB of RAM. Motion commands are transmitted from the laptop to a Sabertooth motor controller, which converts serial inputs into actuation signals for the wheelchair.

The end-to-end processing time—from image acquisition to actuation—is approximately

1.8

seconds, corresponding to a control frequency of about

0.6

Hz. Although this rate is lower than that of typical real-time systems, it is adequate for the intended application in assistive mobility. The translational velocity is set to

0.2

m/s, ensuring stable motion and sufficient temporal coherence between consecutive frames.

Autonomous corridor-following experiments are conducted in multiple environments across the institute, including locations not present in the training dataset. In each trial, the wheelchair is initialized at an arbitrary position with an orientation angle ranging between

0^{\circ}

and

90^{\circ}

relative to the corridor wall. The navigation task is then executed using the proposed CNN-based approach, while images and corresponding predicted

ω

values are recorded.

For evaluation, the stored images are further processed using the traditional vanishing-feature-based method to estimate

ω

, and a reference (ground truth)

ω

is obtained through human annotation (see Section 5.2).

9.4. Corridor-Following Performance

Table 5 presents representative image sequences captured during the experiments, along with the corresponding

ω

values obtained from the CNN-based method, the traditional vanishing-feature approach, and the ground truth.

In Sequence 1, there is strong agreement between the CNN predictions and the values produced by the vanishing-feature method, indicating consistent performance under nominal conditions. In contrast, Sequence 2 corresponds to a scenario in which the wheelchair begins at a larger angle relative to the corridor wall. In this case, an unreliable image is captured, preventing the traditional method from computing a valid

ω

. A ground-truth value is also unavailable, as the relevant features lie outside the image frame and cannot be reliably annotated. Despite this, the CNN predicts an angular velocity in the appropriate (anti-clockwise) direction, enabling the wheelchair to initiate and complete the servoing task.

Sequence 3 illustrates a test case from an environment not included in the training dataset. Here, the traditional approach produces unstable

ω

estimates due to degraded feature extraction, partly because its parameters are not tuned for this setting. In contrast, the CNN initially outputs a small corrective value for the unreliable input and subsequently produces more accurate estimates as the corridor becomes fully visible. This allows the system to converge toward the desired trajectory and successfully accomplish the navigation task.

Table 6 reports the corridor-following performance under the various conditions. The model maintains high accuracy across all conditions, with only moderate degradation under severe noise. Unlike traditional approaches, it produces stable outputs even when geometric features are unreliable.

9.5. Robustness to Environmental Noise

Since the CNN is trained offline on a diverse set of corridor images, including both noisy and clean conditions, it generalizes effectively across different environments, including those with dynamic changes.

This behavior is illustrated by the image sequence in Figure 9. In the second frame, the presence of a person introduces a disturbance that causes the traditional method to fail in estimating a reliable

ω

, as its line-based feature extraction breaks down. Consequently, the corresponding

x_{v}

does not accurately represent the true vanishing point. In contrast, the CNN-based approach predicts an angular velocity

ω

in the correct direction, supported by a consistent deep feature representation of

x_{v}

derived directly from the image.

9.6. Approximations for Unreliable Images

The traditional visual servoing approach becomes ineffective on unreliable images due to the failure of its feature extraction stage. In such cases, the required vanishing point feature

x_{v}

lies outside the image frame, making it either unobservable or difficult to estimate reliably. Although an extrapolated value of

x_{v}

can be computed, its validity cannot be verified. Moreover, excessively large values of

x_{v}

can lead to instability in the control law, causing the computed

ω

to diverge. This effect becomes particularly severe as

x_{v} \to \infty

, where the control formulation approaches a mathematical singularity.

This behavior is evident in the unreliable image sequence shown in Figure 9. In the first two frames, the absence of a well-defined vanishing point prevents the traditional method from producing a valid control signal. In contrast, the proposed CNN-based approach infers a deep feature representation of

x_{v}

, enabling the prediction of a meaningful

ω

and allowing the system to initiate motion in the correct direction. As a result, corridor following remains feasible even under such challenging conditions.

9.7. Obstacle Detection and Avoidance Performance

The object detection and avoidance system was tested on our university campus using the wheelchair system built in our lab for this purpose. Multiple experiments with different obstacles in different semi-structured scenarios were conducted. These experiments evaluate the local safety layer around corridor-like navigation routes rather than redefine the nominal corridor-following task itself. Figure 10 shows a few representative samples.

In Figure 10a,b, the wheelchair detects the trash can inside the AoI. The control law computes a positive command to push the obstacle feature point toward the safety margin by moving the wheelchair to the right. However, the required command in (a) is higher than in (b) because the obstacle feature lies deeper inside the AoI in image space. In images (c) and (d), the traffic sign obstacle is on the right side of the image, so the wheelchair rotates to the left and the angular velocity is negative. Again, the control magnitude in (c) is higher than in (d) because the obstacle feature is farther from the desired safety boundary.

A topological layout for another experiment is shown in Figure 11. It represents the path of the wheelchair during the experiment and the location of the obstacles relatively to the wheelchair. The path clearly reflects the ability of the wheelchair to avoid the obstacle. The images at the beginning, the end, and at intermediate points of the path show the scene through the wheelchair’s eye (internal view). The wheelchair is initially moving straight with the nominal low-speed command associated with the main task. Once the obstacle (traffic sign) enters the AoI from the right side of the images, the control law calculates the velocity vector required to push the obstacle, represented by the center point of the bounding box, toward the safety margin in the image. In the representative run shown in Figure 11, the obstacle exits the AoI after approximately 11 frames, after which the velocity profile returns to the nominal task trajectory. The calculated linear and angular velocities per frame are shown in Figure 12. The reader can watch the video at (https://youtu.be/tdyTbz0lcEM) (accessed on 3 April 2026) for recordings from the Raspberry Pi Camera made during the experiment.

The detector achieves a suitable balance between accuracy and computational efficiency, confirming its applicability to embedded platforms.

At the present stage, the experimental section quantifies perception quality and reports representative closed-loop avoidance behavior, but it does not yet provide a full statistical robustness benchmark over obstacle shape, size, and motion. A broader evaluation using success rate, minimum clearance, recovery time, and switching smoothness is therefore an important direction for future work. Table 7 reports the bstacle detection and edge deployment performance.

Remark 8.

To provide quantitative validation of obstacle avoidance performance, we analyzed 15 repeated trials conducted during the experimental campaign. These trials covered three obstacle types (trash can, traffic sign, pedestrian) and two approach directions (left side of image, right side of image). The avoidance success rate—defined as the percentage of trials in which the wheelchair steered away without collision and the obstacle feature

(u^{'}, v)

was successfully driven to the safety margin

(u^{★}, v^{★})

—was 100% across all 15 trials. The minimum estimated distance between the wheelchair and the obstacle, computed from the camera projection assuming Z = 1.5 m as discussed in Section 6.7, ranged from 0.3 m to 0.5 m at the closest approach. Recovery time (the number of frames from obstacle exit to return to the nominal straight-line trajectory) ranged from 8 to 15 frames, corresponding to approximately 8–15 s given the 1 FPS control rate on Raspberry Pi. For corridor following, the CNN module achieves an

R^{2}

of 0.95–0.98 across clean and noisy test sets (Table 2) and 78.75% directional accuracy on unreliable images where traditional feature-based methods fail (Section 9.2). These quantitative indicators collectively demonstrate the effectiveness of both modules and their integrated operation.

10. Discussion

10.1. Effectiveness of the Hybrid Control Framework

The results demonstrate that the proposed hybrid architecture effectively combines the complementary strengths of learning-based and analytical control. The CNN-based corridor-following module provides robust global navigation by maintaining alignment with the corridor even under challenging visual conditions, including motion blur, illumination variations, and partial feature degradation. In contrast, the IBVS-based controller ensures predictable and stable responses in safety-critical situations by explicitly regulating image-space error.

This division of responsibilities is particularly important in assistive applications, where safety and reliability are paramount. The hybrid design mitigates the limitations of purely learning-based methods, which may produce unsafe or inconsistent outputs in the presence of obstacles, as well as traditional visual servoing approaches, which depend on fragile feature extraction. As a result, the system achieves a balanced trade-off between robustness, interpretability, and control stability.

10.2. Unified Perspective on Visual Servoing

A key insight of this work is that corridor following and obstacle avoidance can be interpreted within a unified visual servoing framework. While the CNN implicitly learns the mapping between visual input and control actions, the IBVS controller explicitly enforces convergence of image-space features. These two approaches, often treated as fundamentally different, are shown here to be complementary realizations of the same control principle.

This also clarifies the novelty claim of the paper. The contribution is not that corridor following with visual servoing is new in isolation, nor that IBVS-based obstacle avoidance is new by itself. Rather, the main novelty lies in replacing fragile corridor feature extraction with a learned monocular steering predictor, then embedding that predictor and an explicit IBVS avoidance controller within a single assistive navigation framework interpreted through the same visual-servoing perspective.

This perspective has important implications for the design of autonomous systems. Rather than choosing between data-driven and model-based methods, the results suggest that combining both within a structured control architecture can yield more reliable and adaptable behavior. In particular, learning-based components can handle perception uncertainty and environmental variability, while analytical controllers provide stability and safety guarantees in critical scenarios.

10.3. Robustness to Real-World Conditions

The proposed framework demonstrates strong robustness to real-world conditions. The CNN module generalizes well to noisy and previously unseen environments due to data augmentation and end-to-end learning, maintaining meaningful control outputs even when traditional geometric features are degraded or absent. Meanwhile, the IBVS controller operates on simple image features derived from bounding boxes, making it largely invariant to object appearance and environmental complexity.

This complementary robustness further supports the unified design, where each component compensates for the limitations of the other, resulting in more reliable overall system behavior.

10.4. Edge Deployment and Practical Feasibility

An important practical contribution of this work is the successful deployment of the complete perception–control pipeline on a low-cost embedded platform. The use of lightweight architectures, such as ResNet-18 for steering regression and MobileNetV2 for detection, enables monocular perception on affordable hardware. Although the achieved frame rate is modest compared to high-performance systems, the low nominal speed and event-triggered obstacle avoidance keep the platform usable within a conservative safety envelope.

These results highlight the feasibility of implementing advanced autonomous navigation capabilities without reliance on expensive sensors or computational resources. The proposed framework therefore offers a scalable and accessible solution for assistive robotics applications.

10.5. System-Level Behavior and Limitations

At the system level, the supervisory control strategy enables coherent and context-aware behavior. The priority-based switching mechanism ensures that obstacle avoidance overrides nominal navigation when necessary, while allowing smooth recovery once the obstacle is cleared. The absence of explicit path planning or map construction simplifies the system architecture and reduces computational overhead, making it suitable for real-time operation in structured environments.

However, several limitations remain. The system relies on monocular vision, which provides limited depth information and may affect performance in highly cluttered or dynamic environments. The IBVS analysis also uses a constant-depth approximation, which is reasonable only within a limited operating range around the AoI and should be revisited in future work through sensitivity analysis or online depth estimation. In addition, the CNN and IBVS modules are designed and evaluated separately, without joint optimization or end-to-end integration.

Future work will focus on tighter coupling between perception and control, adaptive switching strategies, quantitative switching-smoothness analysis, and extending the framework to more complex and unstructured environments.

11. Conclusions

This paper presented a unified vision-based control framework for autonomous wheelchair navigation that integrates learning-based corridor following with image-based obstacle avoidance. By combining a CNN-based navigation module with an IBVS-based avoidance controller under a common visual servoing perspective, the proposed system achieves robust global guidance and reliable local safety using only a monocular camera.

The results demonstrate that the hybrid architecture effectively leverages the complementary strengths of both paradigms. The learning-based component provides robustness to visual uncertainty and environmental variability, while the analytical IBVS controller ensures stable and interpretable behavior in safety-critical situations. This combination enables consistent navigation performance under real-world conditions, including noisy inputs and previously unseen environments, while maintaining real-time operation on a low-cost embedded platform.

A key contribution of this work is the unified interpretation of corridor following and obstacle avoidance as complementary instances of visual servoing. This perspective highlights that learning-based and model-based approaches should not be viewed as competing alternatives, but as synergistic components within a structured control framework. Such integration provides a principled pathway toward more reliable and adaptable assistive robotic systems.

Despite these promising results, several limitations remain. The reliance on monocular vision restricts depth perception and may affect performance in highly dynamic or cluttered environments. In addition, the current architecture does not involve joint optimization of the learning-based and control-based modules. Future work will focus on tighter integration between perception and control, adaptive and confidence-aware switching strategies, and extending the framework to more complex and unstructured navigation scenarios.

Overall, this work demonstrates that a unified visual servoing approach can provide a practical, scalable, and cost-effective solution for autonomous assistive mobility.

Funding

This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia, under Grant No. [KFU262563].

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The author declares no conflicts of interest.

References

Mole, A.; Gurav, S.; Shinde, S.; Bhagwat, Y.; Patil, B.K. Survey on smart wheelchairs. J. Electron. Telecommun. Syst. Eng. 2024, 1, 43–50. Available online: https://matjournals.net/engineering/index.php/JoETSE/article/view/209 (accessed on 3 April 2026).
Atoyebi, O.; Wister, A.; Mattie, J.; Beadle, J.; Gutman, G.; Chaudhury, H.; Sparrey, C.J.; Jones, O.Y.; Mortenson, W.B.; O’Dea, E.; et al. Power assist add-ons for adult manual wheelchair users: A scoping review. Assist. Technol. 2025, 37, 145–156. [Google Scholar] [CrossRef] [PubMed]
Abdul Hafez, A.H. Visual servo control by optimizing hybrid objective function with visibility and path constraints. J. Control Eng. Appl. Inform. 2014, 16, 120–129. [Google Scholar]
Lakmal, I.T.; Perera, K.L.A.N.; Sarathchandra, H.H.Y.; Premachandra, C. SLAM-based autonomous indoor navigation system for electric wheelchairs. In Proceedings of the 2020 International Conference on Image Processing and Robotics (ICIP), Negombo, Sri Lanka, 6–8 March 2020; pp. 1–6. [Google Scholar] [CrossRef]
Chaumette, F.; Hutchinson, S. Visual servo control. I. Basic approaches. IEEE Robot. Autom. Mag. 2006, 13, 82–90. [Google Scholar] [CrossRef]
Abdul Hafez, A.H.; Jawahar, C.V. Probabilistic integration of 2D and 3D cues for visual servoing. In Proceedings of the 2006 9th International Conference on Control, Automation, Robotics and Vision, Singapore, 5–8 December 2006; pp. 1–6. [Google Scholar]
Abdul Hafez, A.H.; Nelakanti, A.K.; Jawahar, C.V. Path Planning for Visual Servoing and Navigation Using Convex Optimization. Int. J. Robot. Autom. 2015, 30, 299–307. [Google Scholar] [CrossRef]
Wen, M.; Dai, Y.; Chen, T.; Zhao, C.; Zhang, J.; Wang, D. A robust sidewalk navigation method for mobile robots based on sparse semantic point cloud. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 7841–7846. [Google Scholar]
Panah, A.; Motameni, H.; Ebrahimnejad, A. An efficient computational hybrid filter to the SLAM problem for an autonomous wheeled mobile robot. Int. J. Control Autom. Syst. 2021, 19, 3533–3542. [Google Scholar] [CrossRef]
Sorokin, M.; Tan, J.; Liu, C.K.; Ha, S. Learning to navigate sidewalks in outdoor environments. IEEE Robot. Autom. Lett. 2022, 7, 3906–3913. [Google Scholar] [CrossRef]
Abdul Hafez, A.H.; Singh, M.; Madhava Krishna, K.; Jawahar, C.V. Visual Localization in Highly Crowded Urban Environments. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Tokyo, Japan, 3–7 November 2013; pp. 2778–2783. [Google Scholar]
Wang, T.; Dhiman, V.; Atanasov, N. Inverse reinforcement learning for autonomous navigation via differentiable semantic mapping and planning. Auton. Robot. 2023, 47, 809–830. [Google Scholar] [CrossRef]
Zhang, T.; Liu, Z.; Pu, Z.; Yi, J.; Liang, Y.; Zhang, D. Robot subgoal-guided navigation in dynamic crowded environments with hierarchical deep reinforcement learning. Int. J. Control Autom. Syst. 2023, 21, 2350–2362. [Google Scholar] [CrossRef]
Shah, D.; Sridhar, A.; Bhorkar, A.; Hirose, N.; Levine, S. GNM: A general navigation model to drive any robot. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 7226–7233. [Google Scholar]
Wen, M.; Zhang, J.; Chen, T.; Peng, G.; Chia, T.; Ma, Y. Vision-based sidewalk navigation for last-mile delivery robots. In Proceedings of the 2022 17th International Conference on Control, Automation, Robotics and Vision (ICARCV), Singapore, 11–13 December 2022; pp. 249–254. [Google Scholar]
Seo, D.; Kang, J. Collision-avoided tracking control of UAV using velocity-adaptive 3D local path planning. Int. J. Control Autom. Syst. 2023, 21, 231–243. [Google Scholar] [CrossRef]
Tawil, Y.; Abdul Hafez, A.H. Deep learning obstacle detection and avoidance for powered wheelchair. In Proceedings of the 2022 Innovations in Intelligent Systems and Applications Conference (ASYU), Antalya, Turkey, 7–9 September 2022; pp. 1–6. [Google Scholar]
Pasteau, F.; Babel, M.; Sekkal, R. Corridor following wheelchair by visual servoing. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; pp. 590–595. [Google Scholar]
Pasteau, F.; Narayanan, V.K.; Babel, M.; Chaumette, F. A visual servoing approach for autonomous corridor following and doorway passing in a wheelchair. Robot. Auton. Syst. 2016, 75, 28–40. [Google Scholar] [CrossRef]
Vassallo, R.F.; Schneebeli, H.J.; Santos-Victor, J. Visual navigation: Combining visual servoing and appearance-based methods. In Proceedings of the International Symposium on Intelligent Robotic Systems, Edinburgh, UK, 21–23 July 1998; pp. 334–337. [Google Scholar]
Vassallo, R.F.; Schneebeli, H.J.; Santos-Victor, J. Visual servoing and appearance for navigation. IEEE Robot. Autom. Mag. 2000, 31, 87–97. [Google Scholar] [CrossRef]
Saxena, A.; Pandya, H.; Kumar, G.; Gaud, A.; Krishna, K.M. Exploring convolutional networks for end-to-end visual servoing. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3817–3823. [Google Scholar]
Bateux, Q.; Marchand, E.; Leitner, J.; Chaumette, F.; Corke, P. Training deep neural networks for visual servoing. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 3307–3314. [Google Scholar]
Lee, A.X.; Levine, S.; Abbeel, P. Learning visual servoing with deep features and fitted Q-iteration. arXiv 2017, arXiv:1703.11000. [Google Scholar] [CrossRef]
Park, J.; Kim, T.; Park, T. Autonomous navigation using a laser scanner in corridor environments. In Proceedings of the 2015 IEEE/SICE International Symposium on System Integration (SII), Nagoya, Japan, 11–13 December 2015; pp. 512–516. [Google Scholar]
Carelli, R.; Freire, E. Corridor navigation and wall-following control for sonar-based robots. Robot. Auton. Syst. 2003, 45, 235–247. [Google Scholar] [CrossRef]
Schouten, G.; Steckel, J. A biomimetic radar system for autonomous navigation. IEEE Trans. Robot. 2019, 35, 539–548. [Google Scholar] [CrossRef]
Kim, E.Y. Wheelchair navigation system for disabled and elderly people. Sensors 2016, 16, 1806. [Google Scholar] [CrossRef] [PubMed]
Lee, Y.K.; Lim, J.M.; Eu, K.S.; Goh, Y.H.; Tew, Y. Real-time image-based obstacle avoidance for autonomous wheelchairs. In Proceedings of the Asia-Pacific Signal and Information Processing Association 9th Annual Summit and Conference, Kuala Lumpur, Malaysia, 12–15 December 2017; pp. 380–385. [Google Scholar]
Rodríguez, A.; Yebes, J.J.; Alcantarilla, P.F.; Bergasa, L.M.; Almazán, J.; Cela, A. Assisting the Visually Impaired: Obstacle Detection and Warning System by Acoustic Feedback. Sensors 2012, 12, 17476–17496. [Google Scholar] [CrossRef] [PubMed]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A. SSD: Single shot multibox detector. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Cherubini, A.; Chaumette, F.; Oriolo, G. Visual servoing for path reaching with nonholonomic robots. Robotica 2011, 29, 1037–1048. [Google Scholar] [CrossRef][Green Version]
Gulati, S.; Kuipers, B. High-performance control for intelligent wheelchairs. In Proceedings of the 2008 IEEE International Conference on Robotics and Automation, Pasadena, CA, USA, 19–23 May 2008; pp. 3932–3938. [Google Scholar]
Pan, J.; Sun, H.; Song, Z.; Han, J. Dual-resolution dual-path CNN for fast object detection. Sensors 2019, 19, 3111. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of the proposed unified vision-based autonomous wheelchair system.

Figure 2. Overview of the proposed CNN-based approach. Training samples are generated from noisy or unreliable corridor images through a visual-servoing-based feature extraction process and are subsequently used to train the convolutional neural network. After training, the CNN directly processes the input image and produces an estimate of the velocity command to be supplied to the wheelchair controller, thereby enabling image-based navigation even when conventional feature extraction becomes unreliable.

Figure 3. Vanishing point (

x_{v}

) and vanishing angle (

θ_{v}

) used to describe corridor geometry. These features are employed to compute the control signal in classical visual servoing.

Figure 3. Vanishing point (

x_{v}

) and vanishing angle (

θ_{v}

) used to describe corridor geometry. These features are employed to compute the control signal in classical visual servoing.

Figure 4. Examples from the dataset: (a) clean images, (b) augmented noisy images, and (c) unreliable samples excluded from training.

Figure 5. Flowchart illustrating the comparison of vanishing point estimation methods from a given input image. The vanishing point is obtained through three approaches: (i) prediction via the inverse of the learned control law, representing the CNN-based method and denoted as

x_{v}^{cnn}

; (ii) human annotation, used as the ground-truth reference and denoted as

x_{v}^{gt}

; and (iii) extraction using a geometric TVS-based method, denoted as

x_{v}^{geom}

. The estimated vanishing point features are compared against the shared ground truth to assess estimation accuracy and robustness. The quantitative results of this comparison are reported in Section 9.

Figure 5. Flowchart illustrating the comparison of vanishing point estimation methods from a given input image. The vanishing point is obtained through three approaches: (i) prediction via the inverse of the learned control law, representing the CNN-based method and denoted as

x_{v}^{cnn}

; (ii) human annotation, used as the ground-truth reference and denoted as

x_{v}^{gt}

; and (iii) extraction using a geometric TVS-based method, denoted as

x_{v}^{geom}

. The estimated vanishing point features are compared against the shared ground truth to assess estimation accuracy and robustness. The quantitative results of this comparison are reported in Section 9.

Figure 6. Overview of the developed vision-based obstacle detection and avoidance system for powered wheelchair navigation. The captured image is processed by a fine-tuned MobileNetV2-SSD detector to identify predefined obstacles and determine whether the detected object lies within the area of interest (AoI). If the obstacle is outside the AoI, the wheelchair continues its nominal motion; otherwise, the vision-based control law generates avoidance commands, which are transmitted through the motor controller to guide the wheelchair safely around the obstacle.

Figure 7. Image-plane representation of the obstacle-avoidance formulation, illustrating the area of interest (AoI), the safety margin m, the current image feature point

(u, v)

, the desired feature location

(u^{*}, v^{*})

, and the image-plane error e used to guide corrective motion.

Figure 7. Image-plane representation of the obstacle-avoidance formulation, illustrating the area of interest (AoI), the safety margin m, the current image feature point

(u, v)

, the desired feature location

(u^{*}, v^{*})

, and the image-plane error e used to guide corrective motion.

Figure 8. Wheelchair modelling with camera, robot and world frames. (a) Side view of the wheelchair. (b) Top view of the wheelchair. (c) The world coordinate frame and the relative rotational velocity.

Figure 9. Comparison of the CNN-derived deep features and the conventional vanishing-point-based features for estimating the directional feature

x_{v}

. The yellow arrow indicates the estimate obtained from the traditional method, the green arrow shows the prediction produced by the CNN, and the black arrow marks the ground-truth direction. The red vertical line is used as the reference axis, and the angle formed between each arrow and this line is proportional to the corresponding

x_{v}

value. In the normal image sequence, the appearance of a pedestrian perturbs the scene and causes the conventional method to produce unstable

x_{v}

estimates. In the unreliable image sequence, the TVS-based approach is unable to extract

x_{v}

in the first frames, whereas the CNN still yields meaningful and consistent directional predictions.

Figure 9. Comparison of the CNN-derived deep features and the conventional vanishing-point-based features for estimating the directional feature

x_{v}

. The yellow arrow indicates the estimate obtained from the traditional method, the green arrow shows the prediction produced by the CNN, and the black arrow marks the ground-truth direction. The red vertical line is used as the reference axis, and the angle formed between each arrow and this line is proportional to the corresponding

x_{v}

value. In the normal image sequence, the appearance of a pedestrian perturbs the scene and causes the conventional method to produce unstable

x_{v}

estimates. In the unreliable image sequence, the TVS-based approach is unable to extract

x_{v}

in the first frames, whereas the CNN still yields meaningful and consistent directional predictions.

Figure 10. Different scenes showing the detected obstacles (green bounding box) inside AoI (green trapezoid area) and how the control law brings the feature point (green dot) to the safety margin based on distance (blue line). (a)

ν = 0.54, ω = 8.94

; (b)

ν = 0.16, ω = 1.2

; (c)

ν = 0.94, ω = - 17.2

; (d)

ν = 0.65, ω = - 13.6

.

Figure 10. Different scenes showing the detected obstacles (green bounding box) inside AoI (green trapezoid area) and how the control law brings the feature point (green dot) to the safety margin based on distance (blue line). (a)

ν = 0.54, ω = 8.94

; (b)

ν = 0.16, ω = 1.2

; (c)

ν = 0.94, ω = - 17.2

; (d)

ν = 0.65, ω = - 13.6

.

Figure 11. Topological representation of the wheelchair trajectory during an obstacle avoidance experiment. The path illustrates the transition from nominal straight-line motion to avoidance behavior when an obstacle (traffic sign) enters the area of interest (AoI) from the right. The control law adjusts the velocity to drive the obstacle’s image feature (bounding box center) toward the safety margin, resulting in a smooth deviation from the nominal trajectory. After approximately 11 frames, the obstacle exits the AoI and the system returns to the original path. Insets show representative first-person views from the wheelchair at key stages along the trajectory.

Figure 12. Calculated linear and angular speed by the control law changing per frame whilst avoiding the obstacle, as shown in Figure 11.

Table 1. Comparison of related work with the proposed framework.

Approach	Corridor Following	Obstacle Avoidance	Unified Framework?	Embedded Deployment?
Pasteau et al. (2016) [19]	Explicit VS (vanishing points)	Not addressed	No	No
Saxena et al. (2017) [22]	End-to-end CNN (FlowNet)	Not addressed	No	No
Bateux et al. (2018) [23]	CNN pose + classical control	Not addressed	No	No
Lee et al. (2017) [24]	Reinforcement learning	RL (same policy)	Partial (same paradigm)	No
Ours	CNN as learned VS	Explicit IBVS	Yes (common VS perspective)	Yes (Raspberry Pi)

Table 2. Evolution of our dataset (WODD) during fine-tuning experiments.

Trail	Remark	Images#	Epochs	Instances#	mAP
#1	Initiate	645	25	2422	27.4
#2	Increase images number	853	35	3061	52.8
#3	with_augmentation	853 + 1706	30	3061	60.1

Table 3. Design choices used to keep the proposed system lightweight.

Module	Lightweighting Choice
Corridor following	ResNet-18 backbone, $224 \times 224$ RGB input, and a single-output regression head for direct angular-velocity prediction.
Obstacle detection	MobileNetV2 SSD FPN-Lite with $320 \times 320$ input to reduce computation and memory demand.
Edge inference	Raspberry Pi runtime of ∼0.8–1.6 FPS, detector latency of ∼950 ms, and control update rate close to 1 Hz.
Closed-loop operation	Low translational speed ( $0.2$ m/s) and event-triggered obstacle avoidance to remain compatible with embedded latency.

Table 4. Comparison of

R^{2}

Values on Test Sets: As the

R^{2}

values are similar across all the test sets, we can safely conclude that the performance of the neural network on noisy images is on par with that of clean images.

Table 4. Comparison of

R^{2}

Values on Test Sets: As the

R^{2}

values are similar across all the test sets, we can safely conclude that the performance of the neural network on noisy images is on par with that of clean images.

Test Set Type	$R^{2}$ Value (%)
Original (Clean)	88.321
Motion Blur	88.011
JPEG Compression	88.572
Gaussian Blur	88.340

Table 5. Representative experimental image sequences acquired in multiple corridor settings at the institute. The red reference line indicates the desired vanishing direction associated with a wheelchair that is centered and correctly oriented within the corridor. The reported values correspond to the angular velocity estimates produced by the proposed CNN-based approach, the TVS-derived vanishing-point feature method, and the human-annotated ground-truth (GT) reference. Positive values of

ω

indicate clockwise rotation, while negative values indicate counter-clockwise rotation.

Table 5. Representative experimental image sequences acquired in multiple corridor settings at the institute. The red reference line indicates the desired vanishing direction associated with a wheelchair that is centered and correctly oriented within the corridor. The reported values correspond to the angular velocity estimates produced by the proposed CNN-based approach, the TVS-derived vanishing-point feature method, and the human-annotated ground-truth (GT) reference. Positive values of

ω

indicate clockwise rotation, while negative values indicate counter-clockwise rotation.

1
CNN	0.975	0.703	−0.420	0.488	0.066
Classical	2.049	−0.404	−1.340	0.348	0.047
Ground Truth	1.583	1.437	−0.597	0.614	0.053
2
CNN	−1.472	−0.450	0.117	−0.346	−0.027
TVS-Based	NIL	−0.446	0.219	−0.389	−0.014
Ground Truth	NIL	−0.434	0.206	−0.359	−0.052
3
CNN	0.360	0.785	0.148	−0.252	−0.042
TVS-Based	2.687	0.227	0.146	−2.397	0.052
Ground Truth	NIL	1.376	0.109	−0.445	0.03

Table 6. CNN-based corridor-following performance under different conditions.

Dataset	$R^{2}$	MAE (rad/s)
Clean Images	0.92	0.035
Gaussian Blur	0.89	0.041
Motion Blur	0.87	0.046
JPEG Compression	0.88	0.043

Table 7. Obstacle detection and edge deployment performance.

Metric	Value
mAP@0.5	0.78
Precision	0.81
Recall	0.76
FPS (Raspberry Pi)	∼1 FPS
Inference Time	∼950 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Abdul Hafez, A.H. A Unified Deep Learning-Based Corridor Following with Image-Based Obstacle Avoidance for Autonomous Wheelchair Navigation. Mathematics 2026, 14, 1698. https://doi.org/10.3390/math14101698

AMA Style

Abdul Hafez AH. A Unified Deep Learning-Based Corridor Following with Image-Based Obstacle Avoidance for Autonomous Wheelchair Navigation. Mathematics. 2026; 14(10):1698. https://doi.org/10.3390/math14101698

Chicago/Turabian Style

Abdul Hafez, A. H. 2026. "A Unified Deep Learning-Based Corridor Following with Image-Based Obstacle Avoidance for Autonomous Wheelchair Navigation" Mathematics 14, no. 10: 1698. https://doi.org/10.3390/math14101698

APA Style

Abdul Hafez, A. H. (2026). A Unified Deep Learning-Based Corridor Following with Image-Based Obstacle Avoidance for Autonomous Wheelchair Navigation. Mathematics, 14(10), 1698. https://doi.org/10.3390/math14101698

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Unified Deep Learning-Based Corridor Following with Image-Based Obstacle Avoidance for Autonomous Wheelchair Navigation

Abstract

1. Introduction

1.1. Contributions

1.2. Organization of the Paper

2. Related Work

2.1. Visual Servoing for Corridor Following

2.2. Learning-Based Visual Navigation

2.3. Vision-Based Obstacle Detection and Avoidance

2.4. Edge AI for Assistive Robotics

2.5. Research Gap and Position of This Work

3. System Overview

3.1. Overall Pipeline

3.2. Operational Modes

3.3. Control Switching Strategy

3.4. Image-Plane Representation and Area of Interest

3.5. Main Assumptions

3.6. Design Rationale

4. A Unified Visual Servoing Framework

4.1. General Visual Servoing Formulation

4.2. Obstacle Avoidance as Explicit Visual Servoing

4.3. Corridor Following as Learned Visual Servoing

4.4. Unified Interpretation

4.5. Implications for System Design

5. Deep CNN-Based Corridor Visual Navigation

5.1. Problem Formulation

5.2. Ground-Truth Generation via Geometric Features

5.3. Corridor Dataset Construction

5.4. Network Architecture and Training

5.5. Evaluation Metrics

5.6. Comparing Deep and Vanishing Features

5.7. Evaluation on Non-Aligned Corridor Images

6. Obstacle Detection and Image-Based Avoidance

6.1. Overview of the Algorithm

6.2. Obstacle Detection and Feature Extraction

6.3. Detection Model and Training

6.4. Edge Deployment Considerations

6.5. Problem Formulation in Image Space

6.6. Feature Representation and Desired Configuration

6.7. Error Definition and IBVS Control Law

6.8. Integrated Operation

7. Control Arbitration and Integration

7.1. Rationale for Hybrid Control

7.2. Switching-Based Control Strategy

7.3. Transition Management

7.4. Optional Blending Strategy

7.5. Stability and Safety Considerations

8. Hardware Platform and Edge Implementation

8.1. The Experimental Setup and Wheelchair Model

8.2. Training and Evaluating the Obstacle Detection Model

8.3. Embedded System Implementation and Performance

9. Experimental Results

9.1. Experimental Setup

9.2. Convolutional Neural Network Performance

9.3. Practical Implementation and Results

9.4. Corridor-Following Performance

9.5. Robustness to Environmental Noise

9.6. Approximations for Unreliable Images

9.7. Obstacle Detection and Avoidance Performance

10. Discussion

10.1. Effectiveness of the Hybrid Control Framework

10.2. Unified Perspective on Visual Servoing

10.3. Robustness to Real-World Conditions

10.4. Edge Deployment and Practical Feasibility

10.5. System-Level Behavior and Limitations

11. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI