Obstacle Avoidance in Mobile Robotics: A CNN-Based Approach Using CMYD Fusion of RGB and Depth Images

El Mechal, Chaymae; Mesbah, Mostefa; El Amrani El Idrissi, Najiba

doi:10.3390/digital6010020

Open AccessArticle

Obstacle Avoidance in Mobile Robotics: A CNN-Based Approach Using CMYD Fusion of RGB and Depth Images

by

Chaymae El Mechal

^1,*

,

Mostefa Mesbah

^2,*

and

Najiba El Amrani El Idrissi

¹

Signal, Systems and Components Laboratory, Faculty of Sciences & Technologies, Sidi Mohamed Ben Abdellah University, Fez 30000, Morocco

²

Department of Electrical & Computer Engineering, Sultan Qaboos University, Muscat 123, Oman

^*

Authors to whom correspondence should be addressed.

Digital 2026, 6(1), 20; https://doi.org/10.3390/digital6010020

Submission received: 19 December 2025 / Revised: 3 February 2026 / Accepted: 10 February 2026 / Published: 2 March 2026

(This article belongs to the Topic Artificial Intelligence Models, Tools and Applications: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Over the last few years, deep neural networks have achieved outstanding results in computer vision, and have been widely integrated into mobile robot obstacle avoidance systems, where perception-driven classification supports navigation decisions. Most existing approaches rely on either color images (RGB) or depth images (D) as the primary source of information, which limits their ability to jointly exploit appearance and geometric cues. This paper proposes a deep learning-based classification approach that simultaneously exploits RGB and depth information for mobile robot obstacle avoidance. The method adopts an early-stage fusion strategy in which RGB images are first converted into the CMYK color space, after which the K (black) channel is replaced by a normalized depth map to form a four-channel CMYD representation. This representation preserves chromatic information while embedding geometric structure in an intensity-consistent channel and is used as input to a convolutional neural network (CNN). The proposed method is evaluated using locally acquired data under different training options and hyperparameter settings. Experimental results show that, when using the baseline CNN architecture, the proposed fusion strategy achieves an overall classification accuracy of 93.3%, outperforming depth-only inputs (86.5%) and RGB-only images (92.9%). When the refined CNN architecture is employed, classification accuracy is further improved across all tested input representations, reaching approximately 93.9% for RGB images, 91.0% for depth-only inputs, 94.6% for the CMYK color space, and 96.2% for the proposed CMYD fusion. These results demonstrate that combining appearance and depth information through CMYD fusion is beneficial regardless of the network variant, while the refined CNN architecture further enhances the effectiveness of the fused representation for robust obstacle avoidance.

Keywords:

robot vision; obstacle avoidance; image fusion; deep learning; convolutional neural network; CMYD

1. Introduction

Fusion of information captured by different types of sensors has found use in many engineering applications, such as traffic control, security, autonomous vehicles, and mobile robotics [1,2]. In computer vision, several studies have focused on the problem of combining the information captured from RGB and depth cameras to enhance the robustness of object detections and scene segmentations [3,4]. However, even with the low cost and excellent quality of RGB and depth cameras, the performance of fusion-based computer vision algorithms did not increase significantly. This situation, however, changed with the introduction of fast processors (such as GPUs) and deep learning algorithms (such as deep neural network (DNN)) [5]. Several techniques have been proposed to achieve the fusion of RGB and depth images in the deep learning framework [6]. These techniques can be classified into three categories [7]:

Early fusion: This type of fusion operates directly at the image pixel level (raw images or transformed images). Some examples of this approach are:

Four-channel input: By stacking the 3 RGB channels and the depth channel, a 4-channel image can be obtained. This 4-channel image serves as input to the DNN classifier [8]. In this case, the DNN architectures require some changes to accommodate the 4-channel input. It is not clear, however, to what extent the new image incorporates the depth information.
Pixel summing/averaging: In this approach, the corresponding pixels of the 3 RGB channels (R, G, and B) are added to those of the depth channel to form a new image [9]. This modified image is then fed to the DNN classifier. A variant to this approach is obtained by concatenating the RGB and depth images to form a larger image that serves as input to the DNN classifier [10].
Color space modification: The RGB channels convey overlapping visual information due to their inherent statistical dependence, which motivates the exploration of alternative color representations for multimodal fusion. In early-stage fusion frameworks, depth information can be incorporated by assigning it to a specific component of a transformed color model. This allows geometric cues to be integrated without increasing the input dimensionality. Such fusion strategies have been successfully applied in RGB–D perception systems to enhance indoor scene understanding and semantic segmentation performance [11]. In recent years, RGB–D fusion methods have continued to evolve, with growing interest in attention-based and transformer-driven architectures that better integrate appearance and depth information [11,12]. These developments emphasize the importance of structured multimodal learning for improving robustness in complex environments. Moreover, this fusion principle can, in principle, be extended to alternative color models such as HSV, YIQ, or CMY, in which a selected channel may be repurposed to encode depth information. In this work, the CMYK color space is adopted, as its K (black) channel explicitly represents global intensity, making it particularly suitable for embedding normalized depth information in a semantically coherent manner [13].

Intermediate fusion: This type of fusion occurs at the feature level. The most widely used structure is the Siamese network structure [11,12,13,14,15]. In this structure, features from RGB and depth images are independently extracted using parallel DNNs. The features are then concatenated and fed into a single classifier. This type of classification network has a relatively large number of parameters. Parallel processing of the two sources can improve the computational performance of the classifier.

Late fusion: In this approach, the RGB and depth images are first classified independently from each other using separate DNNs. The outputs (decisions) of the classifiers are then combined to obtain more reliable decisions. Various late decision fusion methods exist in the literature [16,17]. They include majority voting (for deterministic decisions) and Bayesian fusion (for probabilistic decision).

Compared to the existing types of image fusions, the color space modification approach has several advantages. This approach is simple, can use existing RGB-based DNN architectures without any modifications, and benefits from the transfer of learning property of DNN. It can also be applied to different color spaces, such as RGB, HSV, and CMY.

Although RGB–depth information fusion through color-space modification in DNN frameworks has been used in several computer vision applications, only few researchers have applied this strategy to mobile robot obstacle avoidance (decision making and control).

The proposed approaches use the fused information for semantic segmentation and path generation [18,19,20]. L. Tai et al. [21] proposed a model-free obstacle avoidance algorithm that takes raw depth images as input and generates control commands as output. This algorithm combines the obstacle detection and the control command generation into a single convolutional neural network (CNN or ConvNet). Our proposed obstacle avoidance is an extension of Tai’s work where instead of using only depth images as input, we use a fused image representation derived from RGB images and depth information through CMYK–CMYD transformation.

In this article, we propose a CNN-based classification method for obstacle avoidance using a new early-stage image fusion technique. This approach turned out to be more efficient than the ones using RGB and depth images separately. To fuse the two types of images, we chose the color space modification early fusion approach because of some of the advantages mentioned above. First, the RGB images are converted into CMYK color space. The CMYK channel K (black) is replaced with a normalized version of the depth image to get a CMYD representation. Finally, the CMYD images are used as an input to the CNN. The CNN architecture and hyperparameters were selected to achieve a high classification performance.

The classifier was trained using data acquired locally by image sensors and inertial measure unit (IMU) attached to a TurtleBot mobile robot [22]. The data was collected in an indoor environment. The synchronized RGB and depth images were obtained from Kinect cameras (Microsoft, Redmond, WA, USA) and the orientations (angles) from the IMU. The developed CNN was then used to classify the images into one of 5 classes characterized by the IMU angle information.

In addition to improving classification performance, this work provides a clear analysis of where the proposed novelty lies, both at the representation and architectural levels. First, a novel RGB–depth fusion strategy is introduced through a color-space-driven early fusion mechanism, in which normalized depth information is embedded into the intensity-related K channel of the CMYK color space, resulting in the proposed CMYD representation. Unlike conventional RGB–D stacking or feature-level fusion approaches, this design enables a semantically consistent integration of chromatic appearance and geometric structure at the input level. Second, this study demonstrates that architectural refinement plays a crucial role in effectively exploiting depth information. While depth-only inputs typically underperform compared to RGB images when processed by a conventional CNN, the proposed refined CNN architecture, which integrates multi-scale convolutional analysis and attention-based feature refinement, significantly reduces this performance gap. As a result, depth-based classification becomes comparable to RGB-based performance, and the benefits of the CMYD fusion are further amplified. This dual contribution clarifies that the observed performance gains stem not only from improved data fusion, but also from a network architecture specifically designed to enhance the representation and utilization of geometric information for mobile robot obstacle avoidance.

The remainder of this paper is organized as follows. Section 2 provides a brief overview of convolutional neural networks and their main building blocks. Section 3 describes the proposed CNN-based classification framework, including the refined network architecture. Section 4 details the proposed RGB–Depth fusion strategy based on CMYK and CMYD representations and presents the experimental setup, and discusses the obtained results. Finally, Section 5 concludes the paper and outlines potential directions for future work.

2. Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are increasingly adopted in computer vision applications because they can automatically learn relevant visual features directly from raw image data. In this study, a CNN is designed and used as the main learning model for mobile robot obstacle avoidance. Instead of relying on manually selected features, the network learns to associate sensory images with appropriate navigation actions through training. The structure of the proposed CNN, shown in Figure 1, processes input images through multiple convolutional stages followed by decision-making layers, resulting in an effective and practical solution for autonomous obstacle avoidance [23,24].

2.1. Convolution Layer

The convolution layer (CL) plays a central role in the CNN by learning visual patterns directly from the input feature maps. In this layer, a group of trainable filters is slid across the input, allowing the network to focus on small regions one at a time and capture meaningful local information. Each filter responds to specific patterns present in the data, and its output forms a feature map that reflects where those patterns appear. As the network becomes deeper, successive convolution layers combine these learned patterns to build more abstract and informative representations, which are later used to support the classification process. The operation performed by the convolution layer can be formulated as shown in Equation (1) [26]:

H_{i j k} = f ({(W_{i} X)}_{j k} + b_{i}) f o r i = 1, \dots, k .

(1)

where

H_{i j k}

refers to the response generated at position (j, k) by the (

i_{t h}

) output feature map. The term

W_{i}

represents the learnable convolution kernel applied to the input

X

, while

b_{i}

is the corresponding bias parameter. The nonlinear function

f (.)

enhances the representational capability of the network. In this study, the ReLU activation function is employed.

2.2. Activation Function

The activation function is a key element of a neural network, as it determines how the output of each neuron is computed. After the linear combination of inputs, the activation function modifies the neuron response to allow the network to model nonlinear relationships. In this work, the Rectified Linear Unit (ReLU) was used because of its simplicity and effectiveness. ReLU outputs zero when the input value is negative and keeps the original value when the input is positive, as defined in Equation (2).

H_{i} = m a x (0, H_{i})

(2)

This behavior prevents negative signals from propagating through the network and helps maintain stable gradients during training, which improves learning efficiency.

2.3. Pooling Operation

Pooling is used in convolutional neural networks to progressively reduce the spatial dimensions of feature maps while retaining the most important information. By reducing the size of the feature maps, pooling layers help lower computational complexity and decrease the number of trainable parameters, which improves training efficiency. In this work, commonly used pooling strategies such as max pooling and average pooling are employed to summarize local regions of the feature maps and enhance robustness to small input variations [23].

2.4. Fully Connected Layer

After feature extraction through convolution and pooling layers, the resulting feature maps are flattened and passed to one or more fully connected layers. These layers integrate the learned features and perform high-level reasoning to support the final decision-making process. In the proposed architecture, the fully connected layers are followed by a Softmax layer, which outputs class probabilities for multi-class classification. This design enables the network to map visual features extracted from input images to navigation-related decisions, making it suitable for solving the obstacle avoidance problem in unknown environments.

3. Development of the CNN Architecture

The convolutional neural network presented in this study is derived from a conventional two-dimensional CNN architecture through targeted structural modifications, rather than being proposed as an entirely new model. The original configuration follows a standard vision-based processing pipeline, where two-dimensional images are progressively transformed through convolutional and pooling stages before being mapped to navigation-related outputs via fully connected layers. This conventional setup is adopted as a reference point to allow the effects of the proposed architectural changes to be examined in an explicit and controlled manner. Within the reference model, the expressive capacity of the network is primarily expanded by increasing the number of convolution–pooling layers. Each convolutional layer operates at a fixed spatial scale and performs similar feature extraction operations. As a result, deeper configurations tend to amplify already learned visual patterns instead of introducing qualitatively different representations. For perception-driven navigation tasks, such behavior may be insufficient, as mobile robots must simultaneously reason about local obstacle details and the global structure of the surrounding environment. The proposed architecture departs from this depth-centric design philosophy by redefining the role of individual convolutional layers. Rather than increasing network depth, each layer is engineered to capture visual information at multiple spatial resolutions within a single processing step. This is realized through parallel convolutional operations employing kernels of varying sizes. Smaller receptive fields emphasize fine-grained visual cues such as object boundaries, while larger receptive fields encode coarse spatial information related to traversable regions and scene layout. The parallel responses are subsequently combined, enabling the network to integrate localized and contextual information at an early stage of processing. Through this reformulation, network depth is no longer interpreted as a mere count of sequential layers. Instead, depth is associated with the diversity and richness of representations formed within each layer. This interpretation is particularly suitable for vision-based mobile robot navigation, where reliable decision-making depends on the joint availability of precise local perception and broader environmental awareness. By embedding multi-scale reasoning directly into each convolutional layer, the architecture achieves richer representations without relying on excessive layer stacking.

To further enhance the usefulness of the extracted features, an attention-based refinement mechanism is applied after the multi-scale fusion stage. This module dynamically adjusts the relative contribution of different feature channels, promoting responses that are critical for obstacle avoidance and navigation while attenuating less relevant activations. Such adaptive weighting improves robustness when operating in visually complex or cluttered environments.

The spatial down-sampling strategy is also revised accordingly. In contrast to conventional designs where pooling follows immediately after convolution, pooling in the proposed architecture is deferred until after multi-scale extraction and attention-based refinement. This delay ensures that informative spatial patterns related to obstacle geometry and free-space configuration are adequately processed before any reduction in spatial resolution occurs.

In summary, the proposed CNN retains the same two-dimensional processing framework, input structure, and learning objective as the baseline model while substantially modifying the internal information flow within each convolutional layer. The architectural emphasis is shifted from increasing the number of layers to enhancing intra-layer representational capability, resulting in a perception model that is better aligned with the functional requirements of mobile robot navigation. Finally, the architecture is intentionally designed to remain independent of any specific visual modality at this stage. This allows the evaluation to focus on the inherent learning capacity of the network itself, independent of the characteristics of the input representation. Consequently, the same architecture can be consistently assessed using RGB images, depth data, or fused inputs, enabling an unbiased comparison of sensory representations in subsequent experiments [24]. The architectural differences between the baseline and refined CNN models are summarized in Table 1.

Training Strategy and Hyperparameter Selection

Once the architectural design is established, an appropriate training strategy is defined to ensure stable optimization and reliable generalization of the refined CNN architecture.

The objective of the training phase is to identify a suitable set of hyperparameters, noting that no universal rule or closed-form solution exists for optimal hyperparameter selection, as performance strongly depends on the task characteristics and data distribution. Accordingly, a series of controlled experiments was conducted to evaluate the influence of different training configurations on classification performance. Network optimization was performed using stochastic gradient descent with a momentum coefficient of 0.9 and a weight decay of 0.0005, which are commonly employed to accelerate convergence while mitigating overfitting. Key hyperparameters, including the learning rate, mini-batch size, and number of training epochs, were systematically varied to assess their effects on convergence behavior and testing accuracy. The experimental results indicated that a learning rate of 0.001 provides a stable optimization trajectory, enabling efficient learning without oscillatory behavior or premature convergence. With respect to mini-batch size, moderate batch sizes were found to be more effective. In particular, a mini-batch size of 32 yielded the most consistent and accurate performance across validation runs. From an optimization standpoint, such batch sizes offer a favorable balance between gradient stability and stochasticity, which often contributes to improved generalization compared to larger batch sizes that may lead to overly smooth updates.

Regarding the number of training epochs, the network was trained until convergence, defined by stabilization of the training loss and the absence of further improvement in validation accuracy. This strategy avoids underfitting due to insufficient training as well as overfitting caused by excessive iterations. Overall, the adopted training strategy and hyperparameter configuration were selected based on empirical evaluation rather than heuristic assumptions, ensuring that the reported performance reflects the intrinsic capabilities of the refined CNN architecture.

4. Experimental Results

4.1. Dataset Preparation and Pre-Processing

Data and corresponding RGB and depth images are used for classification. These data are used as input to train and evaluate the classification model. Microsoft Kinect, attached to the TurtleBot 2 personal robot [27], running the Robot Operating System (ROS) [28], was used to capture images in indoor environments. Figure 2 shows the TurtleBot platform used for data collection. The TurtleBot, equipped with a mobile base, a laptop, and a Kinect sensor, and controlled by a joystick, navigated in different environments with obstacles.

After processing and resizing the obtained images into 160 × 120, they were classified into 5 categories as follows: “go-straightforward (1)”, “turning-half-left (2)”, “turning-full-left (3)”, “turning-half-right (4)”, and “turning-full-right (5)”. This classification was performed based on the readings of the IMU sensor. The angles obtained from the IMU sensor were in a range of (−10° ≤ θ ≤ 10°). These angles are divided as shown in Table 2.

The collected dataset was divided into training and testing subsets in order to evaluate the generalization capability of the proposed CNN model. The distribution of images for each navigation class in both the training and testing sets is summarized in Table 3.

4.2. Among Color Spaces

RGB→CMYK→CMYD (4-Channels)

The case study described in this paper adopts a two-stage preprocessing and data fusion framework designed for obstacle detection and autonomous robot navigation, as depicted in Figure 3. The colors in the figure represent the RGB channels, the CMYK components, the depth information, and the resulting CMYD fused image used for obstacle detection.

Stage 1: Convert RGB to CMYK

In the initial preprocessing stage, the RGB input images are normalized to the range [0, 1] by rescaling the pixel intensity values of each color channel. This normalization step ensures numerical stability during subsequent processing. A color-space transformation is then applied to reorganize the visual information into components that are more suitable for the proposed fusion strategy. The conversion from RGB to the CMYK representation follows standard analytical relations commonly adopted in color-space transformation theory and digital image processing pipelines [29,30] as summarized in Equations (3)–(7).

RGB normalization:

R^{'} = \frac{R}{255}, G^{'} = \frac{G}{255}, B^{'} = \frac{B}{255}

(3)

RGB to CMYK:

K = 1 - m a x (R', G', B')

(4)

C = \frac{1 - R^{'} - K}{1 - K}

(5)

M = \frac{1 - G^{'} - K}{1 - K}

(6)

Y = \frac{1 - B^{'} - K}{1 - K}

(7)

Figure 4 presents RGB images and their corresponding CMYK outputs following color space conversion.

Stage 2: Convert CMYK to CMYD

In this stage, depth information is incorporated into the CMYK representation by substituting the K (black) channel with a normalized depth map. This choice is motivated by the functional role of the K channel, which primarily conveys global intensity information. Since depth images also encode scene structure and object proximity through intensity variations, this substitution provides a coherent and semantically meaningful way to embed geometric information into the color representation. Unlike simple channel concatenation, this operation preserves the chromatic information carried by the C, M, and Y channels while introducing depth cues in a channel that is naturally associated with intensity. Prior to replacement, the depth values are normalized to ensure consistency with the dynamic range of the CMYK components. As a result, the generated CMYD images represent the final output of the fusion pipeline and are subsequently used as input for the learning and classification stages. Figure 5 illustrates an example of the resulting CMYD images.

4.3. CMYD Fusion Classification Results

4.3.1. CMYD Fusion Results Using the Baseline CNN

To evaluate the classification performance of the proposed fusion strategy, confusion matrices are generated for the different input representations, namely RGB, Depth, CMYK, and the proposed CMYD fusion, as reported in Table 4, Table 5, Table 6 and Table 7. All experiments presented in this section are conducted using the same baseline CNN architecture without any architectural modification, ensuring that any observed performance differences are exclusively attributed to the input representation. The confusion matrix provides a structured summary of classification outcomes by comparing the predicted labels with the corresponding ground-truth classes. Based on these matrices, several standard performance metrics are derived, including sensitivity, specificity, precision, positive predictive value, and negative predictive value [31]. These metrics allow a detailed and class-wise assessment of the classification behavior.

A comparative analysis of Table 4, Table 5, Table 6 and Table 7 demonstrates that the CMYD representation consistently outperforms both RGB and CMYK inputs when evaluated using the same baseline CNN. This confirms that the observed performance improvement originates from the proposed color-space fusion strategy itself, rather than from any modification to the CNN architecture. This result demonstrates that the performance gain is primarily driven by the proposed CMYD fusion strategy, independently of network architectural complexity.

4.3.2. CMYD Fusion Results Using the Refined CNN Architecture

This section evaluates the performance of the refined CNN architecture using different input representations, namely RGB images, depth images, CMYK color space, and the proposed CMYD fusion. The quantitative comparison of these configurations as reported in Table 8, Table 9, Table 10 and Table 11, where all experiments are conducted using the same refined CNN architecture to ensure a fair evaluation. The results indicate that using RGB or depth information alone provides baseline performance, while the CMYK representation yields moderate improvements by enhancing color component separation. Depth information plays a crucial role by providing geometric cues that are not available in color-based representations. While RGB and CMYK images mainly describe visual appearance, depth images encode the spatial structure of the environment through distance variations, which is particularly important for obstacle detection and navigation. By integrating depth data into the refined CNN framework, the network gains a better understanding of object proximity and scene layout, leading to more reliable discrimination between navigable space and obstacles.

The best performance is achieved using the proposed CMYD representation, where depth information is embedded into the CMYK color space by replacing the K channel. This fusion strategy preserves chromatic information in the C, M, and Y channels while incorporating depth cues in a semantically compatible intensity-related channel. As shown in Table 8, Table 9, Table 10 and Table 11, the CMYD input consistently outperforms RGB, depth-only, and CMYK configurations, demonstrating the effectiveness of combining appearance and geometric information. The refined CNN architecture further enhances these results by efficiently learning hierarchical features from the fused representation, resulting in improved classification accuracy and more robust obstacle avoidance behavior. This indicates that the refined CNN architecture further amplifies the benefits of the CMYD fusion by enabling more effective learning of combined chromatic and geometric features.

4.4. Discussion

This section discusses the experimental results obtained using the proposed CMYD-based RGB–Depth fusion strategy and analyzes the contribution of the refined CNN architecture to mobile robot obstacle avoidance. The discussion is based on the confusion matrices reported in Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10 and Table 11, which provide a detailed class-wise evaluation under both the baseline and refined CNN configurations. The diagonal elements of the confusion matrices represent the number of correctly classified samples for each navigation class, together with their percentages relative to the total number of test images (499 samples). These values correspond to true positive predictions and indicate how effectively the classifier recognizes each navigation command. For example, in Table 4, 108 RGB images (21.6% of the test set) are correctly classified as go-straightforward (class 1). Based on the class distribution reported in Table 3, this corresponds to a class-wise recognition rate of 93.1% for class 1. The off-diagonal elements represent misclassified samples and their percentages relative to the total test set. Most misclassifications occur between neighboring navigation classes, which is expected due to the continuous nature of robot steering commands and the visual similarity of scenes associated with small orientation changes. Nevertheless, the relatively low number of off-diagonal entries across all confusion matrices indicates that the proposed framework maintains good discrimination between navigation commands. Overall system performance is primarily assessed using overall accuracy (ACC), defined as the ratio between the total number of correctly classified samples and the total number of test samples. Given the relatively balanced distribution of classes in the dataset, overall accuracy provides a reliable and representative measure of classification performance.

Using the baseline CNN architecture (Table 4, Table 5, Table 6 and Table 7), the proposed CMYD representation achieves the highest overall accuracy of 93.3%, outperforming RGB images (92.9%), CMYK representation (90.4%), and depth-only inputs (86.5%). This result confirms that embedding depth information into the K channel of the CMYK color space yields a semantically consistent fusion mechanism that preserves chromatic information while introducing meaningful geometric cues related to obstacle proximity and scene structure. The lower performance observed with depth-only inputs highlights the limitation of relying solely on geometric information, while RGB and CMYK representations benefit from appearance cues but lack explicit spatial structure.

When the refined CNN architecture is employed (Table 8, Table 9, Table 10 and Table 11), classification accuracy improves consistently across all tested input representations. Overall accuracy increases to 96.2% for the proposed CMYD representation, while RGB, depth-only, and CMYK inputs also benefit from the architectural refinement. These improvements demonstrate that the refined CNN enhances feature learning by integrating multi-scale convolutional operations within each layer, allowing the network to capture both fine-grained local details and broader contextual information relevant to navigation.

In addition, the incorporation of an attention-based refinement mechanism enables the network to emphasize navigation-relevant features while suppressing less informative activations. This selective feature weighting improves robustness in visually complex environments and reduces confusion between closely related navigation classes. The delayed pooling strategy further preserves discriminative spatial information before down-sampling, leading to more stable and informative feature representations.

Overall, the experimental results highlight the complementary roles of representation-level fusion and architectural refinement. While the CMYD fusion strategy enriches the input by jointly encoding appearance and geometric information, the refined CNN architecture provides an effective mechanism for extracting and integrating this information into robust hierarchical features. The combination of these two contributions yields the best overall accuracy, demonstrating the effectiveness of the proposed framework for reliable mobile robot obstacle avoidance.

4.5. Comparison with the Other Fusion RGBD Methods

This section compares the proposed CMYD-based fusion strategy with representative RGB–Depth fusion approaches reported in the literature, while explicitly distinguishing between the impact of the input representation and the contribution of the CNN architectural refinement. Previous methods such as MDSICNN [32], MMSS [33], and MMFLAN [34] rely on conventional RGB–Depth fusion paradigms, including early, intermediate, and late fusion strategies. Although these approaches demonstrate the potential of combining appearance and depth cues, they generally treat RGB and depth as separate modalities that are either concatenated or merged at predefined stages of the network, which may limit semantic coherence and cross-modal interaction. In contrast, the proposed approach introduces a color-space-driven fusion mechanism, in which depth information is embedded directly into the CMYK color space by replacing the K (black) channel with a normalized depth map, resulting in the CMYD representation. This design differs fundamentally from classical RGB–Depth stacking, as it integrates geometric information into an intensity-related channel while preserving chromatic components in the remaining channels. Using the baseline CNN architecture, the proposed CMYD representation already demonstrates competitive performance, achieving an accuracy of approximately 93.3–93.9%, which is comparable to or higher than existing RGB–Depth fusion methods reported in the literature. This result confirms that the performance gain originates from the fusion strategy itself, rather than from architectural complexity. More importantly, when the refined CNN architecture is employed, which incorporates multi-scale convolutional analysis and attention-based feature refinement, the effectiveness of the CMYD representation is further amplified. Under this configuration, the proposed CMYD fusion achieves a maximum accuracy of 96.2%, representing a clear improvement over both classical RGB–Depth fusion approaches and the baseline CNN results. This substantial gain highlights the complementary relationship between representation-level fusion (CMYD) and architecture-level optimization (attention and multi-scale learning). Unlike existing RGB–Depth methods, where performance improvements are mainly driven by increasingly complex fusion pipelines, the proposed framework achieves superior results through a semantically consistent fusion design combined with targeted architectural refinement. The attention mechanism enables the network to emphasize navigation-relevant chromatic and geometric features while suppressing redundant information, leading to improved robustness and reduced class confusion.

Overall, this comparison demonstrates that the proposed CMYD-based fusion is not merely an alternative RGB–Depth representation, but a more effective and principled integration of appearance and depth information, whose full potential is realized when coupled with the refined CNN architecture. The resulting performance of 96.2% establishes the proposed approach as a strong and competitive solution for RGB–Depth perception in mobile robot obstacle avoidance. As shown in Table 12, the proposed CMYD representation achieves competitive and, in several cases, higher reported performance compared to representative RGB–Depth fusion approaches in the literature when evaluated under their respective experimental settings. Moreover, when combined with the refined CNN architecture, the proposed approach reaches a maximum accuracy of 96.2%, confirming the complementary benefits of the CMYD fusion strategy and architectural refinement.

The referenced studies address different RGB–D problems and are evaluated on different datasets; therefore, the reported performance values are not based on a unified metric (e.g., ACC, mAP, or navigation success rate). Consequently, Table 12 is intended to situate the proposed CMYD-based framework within recent RGB–D perception and mobile robotics research rather than to enable direct numerical comparison across methods.

5. Conclusions

This paper presented a vision-based classification framework for mobile robot obstacle avoidance that jointly exploits appearance (RGB) and geometric (depth) information. Unlike conventional RGB–D fusion approaches that rely on simple channel stacking or late decision fusion, we introduced a novel early-stage fusion strategy based on color-space modification. Specifically, RGB images are first converted into the CMYK color space, after which the K (black) channel is replaced by a normalized depth map, resulting in a four-channel CMYD representation. Embedding depth information into an intensity-related channel enables semantically consistent fusion while preserving chromatic cues in the remaining channels.

Extensive experiments were conducted using a locally collected dataset acquired with a TurtleBot platform equipped with a Kinect sensor and an IMU, where navigation commands were categorized into five motion classes. The results demonstrate that the proposed CMYD fusion consistently outperforms RGB-only, depth-only, and CMYK-based inputs. Using the baseline CNN architecture, the CMYD representation achieved an overall classification accuracy of 93.3%, compared to 92.9% for RGB and 86.5% for depth-only inputs. When evaluated with the refined CNN architecture, performance improved across all input modalities, with the proposed CMYD representation reaching a maximum accuracy of 96.2%. This confirms that the performance gain originates from the fusion strategy itself and is further enhanced by architectural refinement.

The originality of this work lies in two complementary contributions. First, a novel RGB–depth fusion mechanism by embedding normalized depth information into the intensity-related K channel of the CMYK color space is introduced, yielding the CMYD representation, which provides a more principled integration of appearance and geometric cues than conventional RGBD stacking. Second, a refined CNN architecture incorporating multi-scale convolutional analysis and attention-based feature refinement is proposed, enabling more effective extraction of navigation-relevant features and reducing confusion between neighboring motion classes. The observed improvements are therefore not the result of parameter tuning alone, but stem from a task-oriented fusion design combined with enhanced intra-layer feature learning. Overall, the experimental findings demonstrate that representation-level fusion (CMYD) and architecture-level optimization (refined CNN) play complementary roles in achieving robust and accurate obstacle avoidance. The proposed framework establishes an effective and scalable solution for RGB–D perception in mobile robotics. Future work will investigate extending the CMYD fusion strategy to other color spaces, evaluating performance in more complex and dynamic environments, and integrating more advanced attention mechanisms to further improve adaptability and generalization.

Author Contributions

Conceptualization, C.E.M. and M.M.; methodology, C.E.M. and M.M.; software, C.E.M. and M.M.; validation, M.M. and N.E.A.E.I.; formal analysis, C.E.M. and M.M.; data curation, C.E.M. and M.M.; writing—original draft preparation, C.E.M.; writing—review and editing, M.M. and N.E.A.E.I.; visualization, C.E.M.; supervision, M.M. and N.E.A.E.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated and analyzed during the current study are not publicly available due to institutional and technical restrictions but may be made available by the corresponding author upon reasonable request and with permission from the hosting institution.

Acknowledgments

The authors would thank the Department of Electrical and Computer Engineering of Sultan Qaboos University for hosting the first author during the work on the project that let to this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

CL	Convolution Layer
CMY	Cyan, Magenta, Yellow (color model)
CMYK	Cyan, Magenta, Yellow, Key (Black)
CNN	Convolutional Neural Network
DNN	Deep Neural Network
GPUs	Graphics Processing Units
HSV	Hue, Saturation, Value (color model)
IMU	Inertial Measurement Unit
MDSICNN	Multi-Depth Semantic Information CNN (RGBD fusion method)
MMFLAN	Multi-Modal Fusion with Local Attention Network
MMSS	Multi-Modal Semantic Segmentation
ReLU	Rectified Linear Unit (activation function)
RGB	Red, Green, Blue (color model)
RGBD	Red, Green, Blue, Depth (color + depth image)
ROS	Robot Operating System

References

Qian, H.; Wang, M.; Zhu, M.; Wang, H. A Review of Multi-Sensor Fusion in Autonomous Driving. Sensors 2025, 25, 6033. [Google Scholar] [CrossRef]
Pan, J.; Zhong, S.; Yue, T.; Yin, Y.; Tang, Y. Multi-Task Foreground-Aware Network with Depth Completion for Enhanced RGB-D Fusion Object Detection Based on Transformer. Sensors 2024, 24, 2374. [Google Scholar] [CrossRef]
Eitel, A.; Springenberg, J.T.; Spinello, L.; Riedmiller, M.; Burgard, W. Multimodal deep learning for robust RGB-D object recognition. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 681–687. [Google Scholar]
Islam, A.T.; Scheel, C.; Pajarola, R.; Staadt, O. Robust enhancement of depth images from depth sensors. Comput. Graph. 2017, 68, 53–65. [Google Scholar] [CrossRef]
Gaya, J.O.; Goncalves, L.T.; Duarte, A.C.; Zanchetta, B.; Drews, P.; Botelho, S.S.C. Vision-Based Obstacle Avoidance Using Deep Learning. In Proceedings of the 2016 XIII Latin American Robotics Symposium and IV Brazilian Robotics Symposium (LARS/SBR), Recife, Brazil, 8–12 October 2016; pp. 7–12. [Google Scholar]
Liu, Y.; Chen, X.; Wang, Z.; Wang, Z.J.; Ward, R.K.; Wang, X. Deep learning for pixel-level image fusion: Recent advances and future prospects. Inf. Fusion 2018, 42, 158–173. [Google Scholar] [CrossRef]
Ghassemian, H. A review of remote sensing image fusion methods. Inf. Fusion 2016, 32, 75–89. [Google Scholar] [CrossRef]
Li, L.; Qian, B.; Lian, J.; Zheng, W.; Zhou, Y. Traffic scene segmentation based on RGB-D image and deep learning. IEEE Trans. Intell. Transp. Syst. 2017, 19, 1664–1669. [Google Scholar] [CrossRef]
Cui, J.; Han, H.; Shan, S.; Chen, X. RGB-D Face Recognition: A Comparative Study of Representative Fusion Schemes. In Biometric Recognition; Zhou, J., Zhang, Y., Jin, Z., Wang, Y., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; Volume 10996, pp. 358–366. [Google Scholar]
Shao, L.; Cai, Z.; Liu, L.; Lu, K. Performance evaluation of deep feature learning for RGB-D image/video classification. Inf. Sci. 2017, 385–386, 266–283. [Google Scholar] [CrossRef]
Wu, Z.; Zhou, Z.; Allibert, G.; Stolz, C.; Demonceaux, C.; Ma, C. Transformer Fusion for Indoor RGB-D Semantic Segmentation. Comput. Vis. Image Underst. 2024, 249, 104174. [Google Scholar] [CrossRef]
Zhong, L.; Guo, C.; Zhan, J.; Deng, J.Y. Attention-Based Fusion Network for RGB-D Semantic Segmentation. Neurocomputing 2024, 608, 128371. [Google Scholar] [CrossRef]
El Mechal, C.; El Amrani, N. CNN-Based Obstacle Avoidance Using RGB-Depth Image Fusion. In WITS 2020: Proceedings of the 6th International Conference on Wireless Technologies, Embedded, and Intelligent Systems, Fez, Morocco, 14–16 October 2020; Lecture Notes in Electrical Engineering; Springer: Berlin/Heidelberg, Germany, 2022; pp. 867–876. [Google Scholar] [CrossRef]
Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS J. Photogramm. Remote Sens. 2017, 140, 20–32. [Google Scholar] [CrossRef]
Ophoff, T.; Van Beeck, K.; Goedemé, T. Exploring RGB + depth fusion for real-time object detection. Sensors 2019, 19, 866. [Google Scholar] [CrossRef]
Fauvel, M.; Chanussot, J.; Benediktsson, J. Decision fusion for the classification of urban remote sensing images. IEEE Trans. Geosci. Remote Sens. 2006, 44, 2828–2838. [Google Scholar] [CrossRef]
Chavez-Garcia, R.O.; Aycard, O. Multiple sensor fusion and classification for moving object detection and tracking. IEEE Trans. Intell. Transp. Syst. 2016, 17, 525–534. [Google Scholar] [CrossRef]
Hua, M.; Nan, Y.; Lian, S. Small Obstacle Avoidance Based on RGB-D Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 886–894. [Google Scholar]
Meletis, P.; Dubbelman, G. Training of Convolutional networks on multiple heterogeneous datasets for street scene semantic segmentation. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Suzhou, China, 26–30 June 2018; pp. 1045–1050. [Google Scholar]
Deng, L.; Yang, M.; Li, T.; He, Y.; Wang, C. RFBNet: Deep multimodal networks with residual fusion blocks for RGB-D semantic segmentation. arXiv 2019, arXiv:1907.00135. [Google Scholar] [CrossRef]
Tai, L.; Li, S.; Liu, M. A deep-network solution towards model-less obstacle avoidance. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; pp. 2759–2764. [Google Scholar]
Clearpath Robotics Inc. TurtleBot 2 Mobile Robot Platform; Clearpath Robotics Inc.: Kitchener, ON, Canada, 2014; Available online: https://www.turtlebot.com/turtlebot2/ (accessed on 4 May 2021).
Ibrahim, Z.; Isa, D.; Idrus, Z.; Kasiran, Z.; Roslan, R. Evaluation of Pooling Layers in Convolutional Neural Network for Script Recognition. In Communications in Computer and Information Science; Springer: Singapore, 2019; Volume 1100, pp. 121–129. [Google Scholar] [CrossRef]
Saleem, A.; Al Jabri, K.; Al Maashri, A.; Al Maawali, W.; Mesbah, M. Obstacle-Avoidance Algorithm Using Deep Learning Based on RGBD Images and Robot Orientation. In Proceedings of the 2020 7th International Conference on Electrical and Electronics Engineering (ICEEE), Online, 14–16 April 2020; pp. 268–272. [Google Scholar]
Iwaszczuk, D.; Koppanyi, Z.; Gard, N.A.; Zha, B.; Toth, C.; Yilmaz, A. Semantic labeling of structural elements in buildings by fusing RGB and depth images in an encoder-decoder CNN framework. ISPRS Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2018, XLII-1, 225–232. [Google Scholar] [CrossRef][Green Version]
Ferraz, P.A.P.; de Oliveira, B.A.G.; Ferreira, F.M.F.; Martins, C.A.P.d.S. Three-stage RGBD architecture for vehicle and pedestrian detection using convolutional neural networks and stereo vision. IET Intell. Transp. Syst. 2020, 14, 1319–1327. [Google Scholar] [CrossRef]
Quigley, M.; Conley, K.; Gerkey, B.; Faust, J.; Foote, T.; Leibs, J.; Wheeler, R.; Ng, A.Y. ROS: An Open-Source Robot Operating System. In Proceedings of the IEEE International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2009; Volume 2, pp. 1–6. [Google Scholar]
Lenz, I.; Lee, H.; Saxena, A. Deep learning for detecting robotic grasps. Int. J. Robot. Res. 2015, 34, 705–724. [Google Scholar] [CrossRef]
Muratbekova, M.; Toganas, N.; Igali, A.; Shagyrov, M.; Kadyrgali, E.; Yerkin, A.; Shamoi, P. Color Models in Image Pro-cessing: A Review and Experimental Comparison. arXiv 2025, arXiv:2510.00584. [Google Scholar] [CrossRef]
RGB to CMYK Conversion|Color Conversion. Available online: https://www.rapidtables.com/convert/color/rgb-to-cmyk.html (accessed on 4 May 2021).
Powers, D.M.W. Evaluation: From precision, recall and F-Measure to ROC, Informedness, Markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar] [CrossRef]
Qiao, L.; Jing, Z.; Pan, H.; Leung, H.; Liu, W. Private and common feature learning with adversarial network for RGBD object classification. Neurocomputing 2021, 423, 190–199. [Google Scholar] [CrossRef]
Asif, U.; Bennamoun, M.; Sohel, F.A. A Multi-Modal, Discriminative and Spatially Invariant CNN for RGB-D Object Labeling. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2051–2065. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Lu, J.; Lin, R.; Feng, J.; Zhou, J. Correlated and Individual Multi-Modal Deep Learning for RGB-D Object Recognition. 2016. Available online: http://arxiv.org/abs/1604.01655 (accessed on 3 February 2026).
İnner, A.B.; Chachoua, M.E. Depth Matters: Geometry-Aware RGB-D-Based Transformer-Enabled Deep Reinforcement Learning for Mapless Navigation. Appl. Sci. 2026, 16, 1242. [Google Scholar] [CrossRef]
Chiang, S.-Y.; Huang, S.-E. Design of a Patrol and Security Robot with Semantic Mapping and Obstacle Avoidance System Using RGB-D Camera and LiDAR. Comput. Mater. Contin. 2026, 87, 72. [Google Scholar] [CrossRef]

Figure 1. Structure of the CNN [25].

Figure 2. TurtleBot 2 platform used for RGB and depth data acquisition.

Figure 3. RGB–Depth Fusion through CMYD.

Figure 4. (a) RGB, (b) CMYK.

Figure 5. CMYD image. The color representation reflects the CMY channels combined with depth information encoded within the D channel.

Table 1. Comparison between the baseline CNN and the refined CNN architectures.

Aspect	Baseline CNN	Refined CNN	Impact on Feature Learning
Input representation	Generic 2D images	Same input retained	Ensures fair comparison across inputs
Convolution strategy	Single-scale convolution	Multi-scale convolutions (3 × 3, 5 × 5, 7 × 7)	Captures both local and global patterns
Feature extraction	Uniform feature learning	Diverse and complementary feature extraction	Richer visual representations
Feature fusion	Not applied	Explicit multi-scale feature fusion	Improved information integration
Feature weighting	Uniform contribution	Attention-based refinement	Emphasizes task-relevant features
Pooling placement	Early pooling	Delayed pooling	Preserves discriminative information
Depth interpretation	Increased by layer stacking	Increased by intra-layer richness	Higher expressiveness with similar complexity
Overall architecture	Conventional 2D CNN	Architecturally refined 2D CNN	Improved learning efficiency and robustness

Table 2. Class definition based on IMU’s reading.

IMU Angle (°)	Class
$θ \geq 10 °$	$5$
$1 ° \leq θ < 10 °$	$4$
$- 1 ° \leq θ \leq 1 °$	$1$
$- 10 ° < θ \leq - 1 °$	$2$
$θ \leq - 10 °$	3

Table 3. The distribution of training and testing images per class.

	Class 1	Class 2	Class 3	Class 4	Class 5	Total
Training	464	384	376	404	368	1996
Testing	116	96	94	101	92	499
Total	580	480	470	505	460	2495

Table 4. Confusion matrix for RGB-based classification.

		Target Classes
		1	2	3	4	5
Output Classes	1	(108) 21.6%	(1) 0.2%	(3) 0.6%	(0) 0%	(1) 0.2%	95.6–4.4%
	2	(0) 0%	(92) 18.4%	(1) 0.2%	(5) 0.1%	(0) 0%	93.9–6.1%
	3	(2) 0.4%	(0) 0%	(89) 17.8%	(3) 1.2%	(4) 0.8%	90.8–9.2%
	4	(3) 0.7%	(2) 0.4%	(0) 0%	(90) 18%	(2) 0.4%	92.8–7.2%
	5	(3) 0.6%	(1) 0.1%	(1) 0.2%	(3) 0.5%	(85) 17.1%	91.4–8.6%
%	93.1–6.9%		95.8–4.2%	94.6–5.3%	89.1–3.9%	92.4–7.6%	92.9–7.1%

Table 5. Confusion matrix for Depth-based classification.

		Target Classes
		1	2	3	4	5
Output Classes	1	(100) 20%	(3) 0.6%	(2) 0.4%	(4) 0.8%	(5) 0.1%	87.7–12.3%
	2	(6) 01.2%	(81) 16.3%	(4) 0.8%	(2) 0.3%	(3) 0.6%	84.4–13.6%
	3	(5) 1%	(5) 1%	(82) 16.4%	(1) 0.2%	(4) 0.8%	84.5–13.5%
	4	(4) 0.4%	(4) 0.8%	(4) 0.8%	(89) 17.8%	(0) 0%	88.1–11.9%
	5	(1) 0.2%	(3) 0.6%	(2) 0.4%	(5) 0.9%	(80) 16%	87.9–12.1%
%	86.2–13.8%		84.4–15.6%	87.2–12.8%	88.1–11.9%	86.9–13%	86.5–13.5%

Table 6. Confusion matrix for CMYK-based classification.

		Target Classes
		1	2	3	4	5
Output Classes	1	(107) 21.4%	(2) 0.4%	(2) 0.4%	(3) 0.6%	(2) 0.4%	92.1–7.9%
	2	(1) 0.1%	(86) 17.2%	(2) 0.4%	(1) 0.2%	(4) 0.7%	91.5–8.5%
	3	(3) 0.6%	(1) 0.2%	(86) 17.1%	(2) 0.4%	(3) 0.6%	90.5–9.5%
	4	(4) 0.8%	(3) 0.6%	(1) 0.1%	(94) 18.2%	(1) 0.2%	91.2–8.8%
	5	(1) 0.2%	(4) 0.8%	(3) 0.6%	(1) 0.1%	(82) 16.5%	90.1–9%
%	92.2–7.8%		89.6–10.4%	91.5–8.5%	93–7%	89.1–10.9%	90.4–8.6%

Table 7. Confusion matrix for CMYD-based classification.

		Target Classes
		1	2	3	4	5
Output Classes	1	(111) 22.2%	(0) 0%	(3) 0.6%	(1) 0.1%	(1) 0.1%	95.7–4.3%
	2	(0) 0%	(90) 18%	(2) 0.4%	(1) 0.2%	(2) 0.4%	94.7–5.3%
	3	(3) 0.6%	(2) 0.4%	(87) 17.5%	(4) 0.8%	(0) 0%	93.5–6.5%
	4	(1) 0.2%	(1) 0.6%	(1) 0.2%	(92) 18.4%	(3) 0.6%	93.9–6.1%
	5	(1) 0.2%	(3) 0.8%	(1) 0.2%	(3) 0.6%	(86) 17.2%	91.5–8.5%
%	95.7–4.3%		93.8–6.2%	92.6–7.4%	91.1–8.9%	93.5–6.5%	93.3–6.7%

Table 8. Confusion matrix for RGB-based classification (Refined CNN).

		Target Classes
		1	2	3	4	5
Output Classes	1	(111) 22.2%	(1) 0.2%	(1) 0.2%	(1) 0.2%	(0) 0%	96.5–3.5%
	2	(1) 0.2%	(91) 18.2%	(1) 0.2%	(2) 0.4%	(1) 0.2%	93.8–6.2%
	3	(1) 0.2%	(2) 0.4%	(88) 17.6%	(2) 0.4%	(1) 0.2%	93.6–6.4%
	4	(1) 0.2%	(1) 0.2%	(2) 0.4%	(94) 18.8%	(2) 0.4%	95.9–4.1%
	5	(1) 0.2%	(1) 0.2%	(1) 0.2%	(3) 0.6%	(88) 17.6	94.6–5.4%
%	96.5–3.5%		93.8–6.2%	93.6–6.4%	94.0–6.0%	96.7–3.3%	93.9–6.1%

Table 9. Confusion matrix for Depth-based classification (Refined CNN).

		Target Classes
		1	2	3	4	5
Output Classes	1	(105) 21.0%	(3) 0.6%	(3) 0.6%	(2) 0.4%	(0) 0.0%	92.9–7.1%
	2	(4) 0.8%	(86) 17.2%	(3) 0.6%	(3) 0.6%	(0) 0.0%	89.6–10.4%
	3	(2) 0.4%	(3) 0.6%	(80) 16.0%	(2) 0.4%	(1) 0.2%	91.9–8.1%
	4	(3) 0.6%	(3) 0.6%	(5) 1.0%	(92) 18.4%	(1) 0.2%	92.0–8.0%
	5	(2) 0.4%	(1) 0.2%	(3) 0.6%	(2) 0.4%	(91) 18.2%	93.8–6.2%
%	90.5–9.5%		89.6–10.4%	87.2–12.8%	91.1–8.9%	98.9–1.1%	91.0–9.0%

Table 10. Confusion matrix for CMYK-based classification (Refined CNN).

		Target Classes
		1	2	3	4	5
Output Classes	1	(112) 22.4%	(1) 0.2%	(1) 0.2%	(1) 0.2%	(1) 0.2%	96.6–3.4%
	2	(1) 0.2%	(91) 18.2%	(1) 0.2%	(2) 0.4%	(1) 0.2%	94.8–5.2%
	3	(1) 0.2%	(1) 0.2%	(88) 17.6%	(1) 0.2%	(1) 0.2%	96.7–3.3%
	4	(1) 0.2%	(2) 0.4%	(2) 0.4%	(95) 19.0%	(3) 0.6%	92.2–7.8%
	5	(1) 0.2%	(1) 0.2%	(2) 0.4%	(2) 0.4%	(86) 17.2%	93.5–6.5%
%	96.6–3.4%		94.8–5.2%	93.6–6.4%	94.1–5.9%	93.5–6.5%	94.6–5.4%

Table 11. Confusion matrix for CMYD-based classification (Refined CNN).

		Target Classes
		1	2	3	4	5
Output Classes	1	(114) 22.8%	(1) 0.2%	(1) 0.2%	(1) 0.2%	(1) 0.2%	96.6–3.4%
	2	(1) 0.2%	(93) 18.6%	(1) 0.2%	(1) 0.2%	(1) 0.2%	96.9–3.1%
	3	(0) 0.0%	(1) 0.2%	(90) 18.0%	(1) 0.2%	(1) 0.2%	97.8–2.2%
	4	(1) 0.2%	(1) 0.2%	(1) 0.2%	(97) 19.4%	(3) 0.6%	94.2–5.8%
	5	(0) 0.0%	(0) 0.0%	(1) 0.2%	(1) 0.2%	(86) 17.2%	98.9–1.1%
%	98.3–1.7%		96.9–3.1%	95.7–4.3%	96.0–4.0%	93.5–6.5%	96.2–3.8%

Table 12. Positioning of the proposed method within recent RGB-D perception and navigation studies.

Method	Reported Performance
MDSICNN [32]	92.8%
MMSS [33]	92.4%
MMFLAN [34]	93.1%
Geometry-Aware RGB-D (GoT-SAC) [35]	54.0%
RGB-D + LiDAR Semantic SLAM (YOLOv7) [36]	95.4%
Proposed CMYD (Baseline CNN)	93.3%
Proposed CMYD (Refined CNN)	96.2%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

El Mechal, C.; Mesbah, M.; El Amrani El Idrissi, N. Obstacle Avoidance in Mobile Robotics: A CNN-Based Approach Using CMYD Fusion of RGB and Depth Images. Digital 2026, 6, 20. https://doi.org/10.3390/digital6010020

AMA Style

El Mechal C, Mesbah M, El Amrani El Idrissi N. Obstacle Avoidance in Mobile Robotics: A CNN-Based Approach Using CMYD Fusion of RGB and Depth Images. Digital. 2026; 6(1):20. https://doi.org/10.3390/digital6010020

Chicago/Turabian Style

El Mechal, Chaymae, Mostefa Mesbah, and Najiba El Amrani El Idrissi. 2026. "Obstacle Avoidance in Mobile Robotics: A CNN-Based Approach Using CMYD Fusion of RGB and Depth Images" Digital 6, no. 1: 20. https://doi.org/10.3390/digital6010020

APA Style

El Mechal, C., Mesbah, M., & El Amrani El Idrissi, N. (2026). Obstacle Avoidance in Mobile Robotics: A CNN-Based Approach Using CMYD Fusion of RGB and Depth Images. Digital, 6(1), 20. https://doi.org/10.3390/digital6010020

Article Menu

Obstacle Avoidance in Mobile Robotics: A CNN-Based Approach Using CMYD Fusion of RGB and Depth Images

Abstract

1. Introduction

2. Convolutional Neural Networks

2.1. Convolution Layer

2.2. Activation Function

2.3. Pooling Operation

2.4. Fully Connected Layer

3. Development of the CNN Architecture

Training Strategy and Hyperparameter Selection

4. Experimental Results

4.1. Dataset Preparation and Pre-Processing

4.2. Among Color Spaces

RGB→CMYK→CMYD (4-Channels)

4.3. CMYD Fusion Classification Results

4.3.1. CMYD Fusion Results Using the Baseline CNN

4.3.2. CMYD Fusion Results Using the Refined CNN Architecture

4.4. Discussion

4.5. Comparison with the Other Fusion RGBD Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI