1. Introduction
Fusion of information captured by different types of sensors has found use in many engineering applications, such as traffic control, security, autonomous vehicles, and mobile robotics [
1,
2]. In computer vision, several studies have focused on the problem of combining the information captured from RGB and depth cameras to enhance the robustness of object detections and scene segmentations [
3,
4]. However, even with the low cost and excellent quality of RGB and depth cameras, the performance of fusion-based computer vision algorithms did not increase significantly. This situation, however, changed with the introduction of fast processors (such as GPUs) and deep learning algorithms (such as deep neural network (DNN)) [
5]. Several techniques have been proposed to achieve the fusion of RGB and depth images in the deep learning framework [
6]. These techniques can be classified into three categories [
7]:
Early fusion: This type of fusion operates directly at the image pixel level (raw images or transformed images). Some examples of this approach are:
Four-channel input: By stacking the 3 RGB channels and the depth channel, a 4-channel image can be obtained. This 4-channel image serves as input to the DNN classifier [
8]. In this case, the DNN architectures require some changes to accommodate the 4-channel input. It is not clear, however, to what extent the new image incorporates the depth information.
Pixel summing/averaging: In this approach, the corresponding pixels of the 3 RGB channels (R, G, and B) are added to those of the depth channel to form a new image [
9]. This modified image is then fed to the DNN classifier. A variant to this approach is obtained by concatenating the RGB and depth images to form a larger image that serves as input to the DNN classifier [
10].
Color space modification: The RGB channels convey overlapping visual information due to their inherent statistical dependence, which motivates the exploration of alternative color representations for multimodal fusion. In early-stage fusion frameworks, depth information can be incorporated by assigning it to a specific component of a transformed color model. This allows geometric cues to be integrated without increasing the input dimensionality. Such fusion strategies have been successfully applied in RGB–D perception systems to enhance indoor scene understanding and semantic segmentation performance [
11]. In recent years, RGB–D fusion methods have continued to evolve, with growing interest in attention-based and transformer-driven architectures that better integrate appearance and depth information [
11,
12]. These developments emphasize the importance of structured multimodal learning for improving robustness in complex environments. Moreover, this fusion principle can, in principle, be extended to alternative color models such as HSV, YIQ, or CMY, in which a selected channel may be repurposed to encode depth information. In this work, the CMYK color space is adopted, as its K (black) channel explicitly represents global intensity, making it particularly suitable for embedding normalized depth information in a semantically coherent manner [
13].
Intermediate fusion: This type of fusion occurs at the feature level. The most widely used structure is the Siamese network structure [
11,
12,
13,
14,
15]. In this structure, features from RGB and depth images are independently extracted using parallel DNNs. The features are then concatenated and fed into a single classifier. This type of classification network has a relatively large number of parameters. Parallel processing of the two sources can improve the computational performance of the classifier.
Late fusion: In this approach, the RGB and depth images are first classified independently from each other using separate DNNs. The outputs (decisions) of the classifiers are then combined to obtain more reliable decisions. Various late decision fusion methods exist in the literature [
16,
17]. They include majority voting (for deterministic decisions) and Bayesian fusion (for probabilistic decision).
Compared to the existing types of image fusions, the color space modification approach has several advantages. This approach is simple, can use existing RGB-based DNN architectures without any modifications, and benefits from the transfer of learning property of DNN. It can also be applied to different color spaces, such as RGB, HSV, and CMY.
Although RGB–depth information fusion through color-space modification in DNN frameworks has been used in several computer vision applications, only few researchers have applied this strategy to mobile robot obstacle avoidance (decision making and control).
The proposed approaches use the fused information for semantic segmentation and path generation [
18,
19,
20]. L. Tai et al. [
21] proposed a model-free obstacle avoidance algorithm that takes raw depth images as input and generates control commands as output. This algorithm combines the obstacle detection and the control command generation into a single convolutional neural network (CNN or ConvNet). Our proposed obstacle avoidance is an extension of Tai’s work where instead of using only depth images as input, we use a fused image representation derived from RGB images and depth information through CMYK–CMYD transformation.
In this article, we propose a CNN-based classification method for obstacle avoidance using a new early-stage image fusion technique. This approach turned out to be more efficient than the ones using RGB and depth images separately. To fuse the two types of images, we chose the color space modification early fusion approach because of some of the advantages mentioned above. First, the RGB images are converted into CMYK color space. The CMYK channel K (black) is replaced with a normalized version of the depth image to get a CMYD representation. Finally, the CMYD images are used as an input to the CNN. The CNN architecture and hyperparameters were selected to achieve a high classification performance.
The classifier was trained using data acquired locally by image sensors and inertial measure unit (IMU) attached to a TurtleBot mobile robot [
22]. The data was collected in an indoor environment. The synchronized RGB and depth images were obtained from Kinect cameras (Microsoft, Redmond, WA, USA) and the orientations (angles) from the IMU. The developed CNN was then used to classify the images into one of 5 classes characterized by the IMU angle information.
In addition to improving classification performance, this work provides a clear analysis of where the proposed novelty lies, both at the representation and architectural levels. First, a novel RGB–depth fusion strategy is introduced through a color-space-driven early fusion mechanism, in which normalized depth information is embedded into the intensity-related K channel of the CMYK color space, resulting in the proposed CMYD representation. Unlike conventional RGB–D stacking or feature-level fusion approaches, this design enables a semantically consistent integration of chromatic appearance and geometric structure at the input level. Second, this study demonstrates that architectural refinement plays a crucial role in effectively exploiting depth information. While depth-only inputs typically underperform compared to RGB images when processed by a conventional CNN, the proposed refined CNN architecture, which integrates multi-scale convolutional analysis and attention-based feature refinement, significantly reduces this performance gap. As a result, depth-based classification becomes comparable to RGB-based performance, and the benefits of the CMYD fusion are further amplified. This dual contribution clarifies that the observed performance gains stem not only from improved data fusion, but also from a network architecture specifically designed to enhance the representation and utilization of geometric information for mobile robot obstacle avoidance.
The remainder of this paper is organized as follows.
Section 2 provides a brief overview of convolutional neural networks and their main building blocks.
Section 3 describes the proposed CNN-based classification framework, including the refined network architecture.
Section 4 details the proposed RGB–Depth fusion strategy based on CMYK and CMYD representations and presents the experimental setup, and discusses the obtained results. Finally,
Section 5 concludes the paper and outlines potential directions for future work.
3. Development of the CNN Architecture
The convolutional neural network presented in this study is derived from a conventional two-dimensional CNN architecture through targeted structural modifications, rather than being proposed as an entirely new model. The original configuration follows a standard vision-based processing pipeline, where two-dimensional images are progressively transformed through convolutional and pooling stages before being mapped to navigation-related outputs via fully connected layers. This conventional setup is adopted as a reference point to allow the effects of the proposed architectural changes to be examined in an explicit and controlled manner. Within the reference model, the expressive capacity of the network is primarily expanded by increasing the number of convolution–pooling layers. Each convolutional layer operates at a fixed spatial scale and performs similar feature extraction operations. As a result, deeper configurations tend to amplify already learned visual patterns instead of introducing qualitatively different representations. For perception-driven navigation tasks, such behavior may be insufficient, as mobile robots must simultaneously reason about local obstacle details and the global structure of the surrounding environment. The proposed architecture departs from this depth-centric design philosophy by redefining the role of individual convolutional layers. Rather than increasing network depth, each layer is engineered to capture visual information at multiple spatial resolutions within a single processing step. This is realized through parallel convolutional operations employing kernels of varying sizes. Smaller receptive fields emphasize fine-grained visual cues such as object boundaries, while larger receptive fields encode coarse spatial information related to traversable regions and scene layout. The parallel responses are subsequently combined, enabling the network to integrate localized and contextual information at an early stage of processing. Through this reformulation, network depth is no longer interpreted as a mere count of sequential layers. Instead, depth is associated with the diversity and richness of representations formed within each layer. This interpretation is particularly suitable for vision-based mobile robot navigation, where reliable decision-making depends on the joint availability of precise local perception and broader environmental awareness. By embedding multi-scale reasoning directly into each convolutional layer, the architecture achieves richer representations without relying on excessive layer stacking.
To further enhance the usefulness of the extracted features, an attention-based refinement mechanism is applied after the multi-scale fusion stage. This module dynamically adjusts the relative contribution of different feature channels, promoting responses that are critical for obstacle avoidance and navigation while attenuating less relevant activations. Such adaptive weighting improves robustness when operating in visually complex or cluttered environments.
The spatial down-sampling strategy is also revised accordingly. In contrast to conventional designs where pooling follows immediately after convolution, pooling in the proposed architecture is deferred until after multi-scale extraction and attention-based refinement. This delay ensures that informative spatial patterns related to obstacle geometry and free-space configuration are adequately processed before any reduction in spatial resolution occurs.
In summary, the proposed CNN retains the same two-dimensional processing framework, input structure, and learning objective as the baseline model while substantially modifying the internal information flow within each convolutional layer. The architectural emphasis is shifted from increasing the number of layers to enhancing intra-layer representational capability, resulting in a perception model that is better aligned with the functional requirements of mobile robot navigation. Finally, the architecture is intentionally designed to remain independent of any specific visual modality at this stage. This allows the evaluation to focus on the inherent learning capacity of the network itself, independent of the characteristics of the input representation. Consequently, the same architecture can be consistently assessed using RGB images, depth data, or fused inputs, enabling an unbiased comparison of sensory representations in subsequent experiments [
24]. The architectural differences between the baseline and refined CNN models are summarized in
Table 1.
Training Strategy and Hyperparameter Selection
Once the architectural design is established, an appropriate training strategy is defined to ensure stable optimization and reliable generalization of the refined CNN architecture.
The objective of the training phase is to identify a suitable set of hyperparameters, noting that no universal rule or closed-form solution exists for optimal hyperparameter selection, as performance strongly depends on the task characteristics and data distribution. Accordingly, a series of controlled experiments was conducted to evaluate the influence of different training configurations on classification performance. Network optimization was performed using stochastic gradient descent with a momentum coefficient of 0.9 and a weight decay of 0.0005, which are commonly employed to accelerate convergence while mitigating overfitting. Key hyperparameters, including the learning rate, mini-batch size, and number of training epochs, were systematically varied to assess their effects on convergence behavior and testing accuracy. The experimental results indicated that a learning rate of 0.001 provides a stable optimization trajectory, enabling efficient learning without oscillatory behavior or premature convergence. With respect to mini-batch size, moderate batch sizes were found to be more effective. In particular, a mini-batch size of 32 yielded the most consistent and accurate performance across validation runs. From an optimization standpoint, such batch sizes offer a favorable balance between gradient stability and stochasticity, which often contributes to improved generalization compared to larger batch sizes that may lead to overly smooth updates.
Regarding the number of training epochs, the network was trained until convergence, defined by stabilization of the training loss and the absence of further improvement in validation accuracy. This strategy avoids underfitting due to insufficient training as well as overfitting caused by excessive iterations. Overall, the adopted training strategy and hyperparameter configuration were selected based on empirical evaluation rather than heuristic assumptions, ensuring that the reported performance reflects the intrinsic capabilities of the refined CNN architecture.
5. Conclusions
This paper presented a vision-based classification framework for mobile robot obstacle avoidance that jointly exploits appearance (RGB) and geometric (depth) information. Unlike conventional RGB–D fusion approaches that rely on simple channel stacking or late decision fusion, we introduced a novel early-stage fusion strategy based on color-space modification. Specifically, RGB images are first converted into the CMYK color space, after which the K (black) channel is replaced by a normalized depth map, resulting in a four-channel CMYD representation. Embedding depth information into an intensity-related channel enables semantically consistent fusion while preserving chromatic cues in the remaining channels.
Extensive experiments were conducted using a locally collected dataset acquired with a TurtleBot platform equipped with a Kinect sensor and an IMU, where navigation commands were categorized into five motion classes. The results demonstrate that the proposed CMYD fusion consistently outperforms RGB-only, depth-only, and CMYK-based inputs. Using the baseline CNN architecture, the CMYD representation achieved an overall classification accuracy of 93.3%, compared to 92.9% for RGB and 86.5% for depth-only inputs. When evaluated with the refined CNN architecture, performance improved across all input modalities, with the proposed CMYD representation reaching a maximum accuracy of 96.2%. This confirms that the performance gain originates from the fusion strategy itself and is further enhanced by architectural refinement.
The originality of this work lies in two complementary contributions. First, a novel RGB–depth fusion mechanism by embedding normalized depth information into the intensity-related K channel of the CMYK color space is introduced, yielding the CMYD representation, which provides a more principled integration of appearance and geometric cues than conventional RGBD stacking. Second, a refined CNN architecture incorporating multi-scale convolutional analysis and attention-based feature refinement is proposed, enabling more effective extraction of navigation-relevant features and reducing confusion between neighboring motion classes. The observed improvements are therefore not the result of parameter tuning alone, but stem from a task-oriented fusion design combined with enhanced intra-layer feature learning. Overall, the experimental findings demonstrate that representation-level fusion (CMYD) and architecture-level optimization (refined CNN) play complementary roles in achieving robust and accurate obstacle avoidance. The proposed framework establishes an effective and scalable solution for RGB–D perception in mobile robotics. Future work will investigate extending the CMYD fusion strategy to other color spaces, evaluating performance in more complex and dynamic environments, and integrating more advanced attention mechanisms to further improve adaptability and generalization.