1. Introduction
The visual system is the primary source of information resources for humans, with studies proving that over 80% of sensory inputs are directly or indirectly based on vision [
1,
2]. It is a highly efficient mechanism with the capability of processing multi-channel inputs, allowing humans to understand complex environments and serving as a fundamental basis for behavior [
3,
4,
5]. Within this system, photoreceptors, specifically rods cells and cones cells, capture light and transpose it into signals in neurons [
6]. Cone cells, in particular, play a crucial role in color vision by processing signals from three channels that are divided according to the wavelength, corresponding to red, green, and blue (RGB) inputs [
7,
8]. The principles of multi-channel visual processing have inspired modern artificial neural network models [
9,
10,
11]. Convolutional neural networks (CNNs) and vision transformers (ViTs) have been developed to handle preprocessed motion direction detection tasks in some situations. CNNs utilize spatial hierarchies to extract motion features, while ViTs employ self-attention mechanisms to capture long-range dependencies, leading to superior performance in image classification and recognition tasks [
12,
13,
14]. However, despite these advancements, many artificial models struggle to replicate the full capabilities of biological vision. Models based on artificial neural networks are designed to process RGB images, though they may still fail to detect pixel-level motion or generalize to diverse image transformations [
15,
16]. This limitation underscores the need for more adaptive solutions that are more likely to be bio-inspired by the visual systems. Many functions of the visual system are shaped by learning and environmental factors, highlighting the potential of bio-inspired systems to replicate these capabilities [
17,
18].
The retina is mostly regarded as the first processing terminal of the visual system [
2,
7]. It integrates and modulates visual input through multiple neural layers [
19,
20]. This suggests that bio-inspired models or neural networks based on retinal mechanisms have the potential to surpass traditional neural network models in design. The direction selectivity is a very important function in the retina of mammals; this function helps mammals to enable detecting ability and motion direction processing [
18,
21]. These task of direction selectivity solve such situations as navigation and predator avoidance, which depend on the behavior of them. Hubel and Wiesel’s pioneering work in the 1960s revealed that neurons in the primary visual cortex (V1) are selectively responsive to specific orientations and directions of motion [
22]. The model they proposed suggests that the alignment of the structures in the retina contributes to the selectivity in the primary cortex. In their further research, it was found that the simple cells and complex cells are located in the V1 of the primary cortex. These kinds of cells have the ability of direction selection when groups of light spots or thin lines are detected, which shows that the selectivity is widely spread in the retina [
23].
Horizontal cells (HCs), which are necessary for early-stage signal processing, are also included in the multiple neural layers of the retina [
24,
25,
26]. HCs are known to mediate lateral inhibition, enhancing contrast and refining visual perception [
27,
28]. Recent studies suggest that horizontal cells may also play a critical role in determining direction selectivity by influencing the ganglion cells within the retina [
29]. Specifically, research has shown that injecting a current into horizontal cells affects the activity of the retinal ganglion cells involved in detecting both fast and slow motion, as well as direction selectivity [
30,
31]. This result shows that HCs may contribute significantly to the modulation of direction-selective responses in the retina. Another work of research on retina cells showed that the direction-selective retinal ganglion cells integrate global motion, generating sparse yet consistent responses aligned with specific motion directions, especially in large data populations. Studies have demonstrated that local motion decoders rely on correlated neural activity to achieve spatial resolution, whereas global motion decoders assume independent inputs to facilitate continuous motion direction readout [
32,
33]. This highlights the critical role of the ganglion cells in global motion perception. These processed signals are subsequently transmitted to the lateral geniculate nucleus (LGN) and deeper cortical areas for further interpretation, forming the foundation for biological decision making [
19,
34]. Compared to traditional neural-network-based methods, bio-inspired models offer faster processing speeds and higher accuracy, particularly in tasks closely related to human visual perception [
11,
35]. Traditional neural networks, which rely on perceptron-based architectures, often lose critical associations between input signals [
36,
37]. While transformer-based attention mechanisms provide a potential solution, they require extensive computational resources and significant processing time [
14,
38].
Dendritic structures are fundamental to neuronal processing, particularly in the visual system. As a key function of the brain—an organ composed of over
neurons—the visual system relies on the intricate interactions between these neurons, which are mediated by their dendritic structures. Remarkably, this system facilitates more than
interactions, highlighting the critical role of dendritic organization in neural computation [
39,
40]. Unlike perceptron-based models, which disregard the complex interactions of synapses and dendrites, dendritic neuron models integrate the excitatory and inhibitory inputs across spatial locations, enabling nonlinear signal processing in model [
41,
42]. This enhances the ability to capture local features and reconstruct the importance of individual inputs in classification tasks. Dendritic neuron models provide an effective framework for input information processing by mimicking the intricate interactions of local signals, which not only improves the computational and learning efficiency but also the generalization capability and robustness [
42,
43].
Although dendritic-neuron-based learning models show promise, some dendritic neuron models remain limited to grayscale image processing and lack the ability to handle multi-channel inputs such as RGB images [
44]. Consequently, they fail to capture the full complexity of natural vision. While certain models have explored multi-channel processing, their structures remain unclear due to the deep and complex processing mechanisms of the retina in the LGN [
19,
45]. This lack of clarity makes conclusive model development challenging and increases architectural complexity. As a result, multi-channel processing based on the LGN may lack sufficient biological interpretability [
46]. These limitations suggest that such models may not fully replicate the biological efficiency and adaptability required for complex visual tasks.
Based on these principles, our previous study introduced an improved artificial visual system (AVS) that uses dendritic-neuron-based structures to replicate how the retina processes information. Unlike earlier single-channel systems, our model can detect motion direction in RGB images instead of grayscale ones. By leveraging the direction selectivity of horizontal cells, we improve motion detection, making the approach more aligned with how biological vision works [
47]. This allows the system to effectively process multi-channel visual signals, combining color information and motion direction in a way that closely resembles natural vision. As a result, our model brings artificial vision closer to the complexity and functionality of biological vision.
Despite the advancements in visual motion detection models, an essential scientific challenge appears. The challenge is to replicate the direction selectivity and noise robustness of a biological-visual-system-based computational model, especially for the pixel-level motion scenario. While deep learning models perform well at high-level motion representation, they often struggle with frame-by-frame processing and lead to high computational costs. In contrast, our previous model, AVS, exhibits a higher computational efficiency but relies on unclear biological mechanisms such as the LGN with its multiplicative operations hypothesis. To address these limitations, we propose a learnable horizontal-cell-based dendritic neuron model (HCdM), which utilizes the established direction selectivity of HCs and a mechanism with biological evidence rather than the LGN, which locates at deeper structure in visual system. The proposed model not only enhances the biological reasonability but also the computational efficiency compared to other models. Thus, the HCdM has a balance with appropriate reduction in computational cost and improvement of reasonability in motion direction tasks.
The novelty of this work mainly lies in its bio-inspired dendritic structural connection to HCs, which replicates localized visual computation and enables retinal direction selectivity, yet such features are often overlooked. Although most computational models ignore this biological mechanism, it plays a crucial role in recognizing object edges from background noise. This structure enables the HCdM to not only process RGB motion direction with high precision but also maintain robustness under noisy conditions and dataset shifts. Our goal is to explore whether a hierarchical dendritic framework with synaptic learning can simulate the learning process of direction selectivity and outperform conventional deep models under constrained scenarios.
After simulations of our model, the experimental results indicate that our model accurately identifies nearly all motion directions in pixel-level object movement across eight directions in consecutive frames. In comparison with conventional CNNs, which lack multi-channel and multi-modal processing capabilities, we extended the CNNs through specific methods by dataset preprocessing to incorporate them into our comparisons. For models with multi-modal processing abilities, such as ViTs, we employed both a spatiotemporal ViT (stViT) and a two-stream ViT (2sViT). The former expands along the temporal sequence for motion direction detection, while the latter defines the dataset as multi-modal and processes it accordingly [
48,
49]. Self-supervised temporal learning models such as DuTriNet, which incorporates a flow-intensity-based non-parametric frame sampling algorithm within a triplet Siamese architecture, may be considered as solutions to motion direction detection models due to their strong ability to identify significant motion shifts, such as body pose transitions in action recognition tasks [
50]. However, these models lack sensitivity to the subtle, pixel-level displacements present in our dataset, which is the most crucial difference from our testing dataset to its aiming dataset. These methods helps models finish the recognition tasks on more than a single image. Experimental results demonstrate that our model exhibits high accuracy and computational efficiency. Simulations further confirm its precision, even when trained on relatively small training datasets, which is an advantage that is not observed in other models. This study simulates the learning process of direction selectivity in the brain, reinforcing its basic biological theory. Additionally, our simulations suggest that, in RGB images, horizontal cells theoretically achieve direction selectivity through their relative influence on other cells. This supports the role of computational simulations in exploring deeper principles in biological theories.
Our study draws inspiration from neuroscience, particularly the work of Hubel and Wiesel, to explore how dendritic structures contribute to direction selectivity in visual processing. By integrating this principle into our model, we not only improve its ability to learn direction selectivity but also enhance its biological plausibility. Theoretically, our model’s dendritic structure that amplifies imperceptible motion cues between frames enables the detection of extremely fine-grained changes. This approach strengthens the alignment of our model with biological mechanisms, making it a more accurate representation of how biological vision systems operate. Moreover, our approach has wide-ranging potential applications in motion direction tasks, such as autonomous driving, robotic vision, and medical imaging analysis. Practically, this makes our model particularly suitable for pixel-level motion tasks like precision manufacturing or cell movement tracking, where detecting subtle displacements is crucial. These examples demonstrate the practical value of our findings and underscore the broader impact of bio-mimetic computational models in solving real-world problems.
2. Methods
To simulate motion direction detection process in biological visual system, we designed our model following the functional structure of the retina. We began with a layerwise model which replicated the key components of retina. First, we simulated the outer nuclear cell layer where photoreceptor cells receive light input and separated it into three channels. Next, we modeled the inner nuclear layer, focusing on the HCs and their lateral inhibitory functions that modulate bipolar cell outputs to enable contrast enhancement and motion direction selectivity. We employed dendritic neuron structures to integrate the information features in motion direction detection in the ganglion cell layer, where these signals are then passed to. Finally, we implemented a backpropagation-based training mechanism that adjusts the connection positions of the synapses according to the prediction errors. This structure enables the model to not only reflect biological processing principles but also learn motion direction efficiently in RGB images.
2.1. Retinal RGB Image Processing
In retina, three types of cone cells are considered to process color information. Each type of cone cell is sensitive to its own responsible wavelength range of light. This biological mechanism, based on the detection of three distinct wavelengths, closely aligns with the RGB color model used in digital imaging, as the RGB color spectrum also represents different wavelength ranges of light [
51]. As shown in
Figure 1, cone cells enable color perception, while rod cells support low-light vision by detecting the presence or absence of light [
52]. In the figure, we assume that the input image is made up of
pixels and active
photorecpetor cells; a single-pixel
(black square) in the image is composed of three channel lights with different wavelengths that are received by the cone cells of different channels:
,
, and
. The similarity between biological color perception and the RGB format provides a strong basis for using three-channel image datasets in computational artificial visual systems [
53].
Furthermore, the retina processes color information in parallel, where different color channels are handled independently before being integrated at higher levels of processing [
54]. This parallel processing suggests that an artificial visual system should also adopt a multi-channel approach to better simulate biological vision [
55]. By maintaining separate processing pathways for each channel, our model is designed to capture motion and spatial information as a ‘feature’ in a biological mechanism.
2.2. On–Off Response Mechanism Based on Horizontal Cells
The detection of motion direction begins when the retina receives light stimuli [
31]. Photoreceptor cells convert these stimuli into electrical signals, which are processed by various neural layers in the retina, including horizontal cells [
28]. Horizontal cells play a crucial role in lateral inhibition in the process; the HCs adjust the response of bipolar cells and contribute to direction selectivity [
26]. As shown in
Figure 2a, HCs are located in the retina between photoreceptors and bipolar cells, modulating visual signals before they reach the ganglion cells [
24].
In the case of RGB image processing, our model extends the horizontal cell mechanism to handle the complexity introduced by multi-channel inputs [
27]. Each color channel is processed separately, mimicking how biological vision treats different wavelengths of light [
25]. For each channel, photoreceptors transform incoming light from different channels of cone cells into signals representing pixel intensity [
28]. These signals are then processed by HCs to detect local intensity variations, thereby enabling motion detection by recognizing the object boundary [
30].
We define this mechanism as the horizontal On–Off response, where HCs analyze the differences between surrounding pixel intensities and a central pixel within a local receptive field [
27]. Since the HCs are not able to output to the ganglion cells directly, we designed a specific neuron that presents the cooperation of HCs and other cells like amacrine cells or bipolar cells to emphasize the importance of HCs in direction selectivity [
29]. This comparison is made against a predefined threshold
to activate the horizontal cells. The
is the minimum resolvable grayscale difference threshold, which is determined based on the dynamic range of human visual perception [
56]. A schematic representation of how a local receptive field in a single channel contributes to direction selectivity through horizontal cells is shown in
Figure 2b: The aimed central pixel in image
t is highlighted in red, and the surrounding pixels in image
are highlighted in blue. All the surrounding pixels are compared with the central pixels by the horizontal cells and influence the final output. For a receptive field,
is the input matrix located at
for dendritic ganglion cells layer. The output of this kind of specific neuron is
where
is the threshold of the resolvable grayscale value and is set to 3 as the lowest difference that can be recognized.
2.3. Dendritic Ganglion Cells in Direction Selectivity
Dendritic neuron models have been shown to be highly valuable in AVS in prior research, including our own previous studies. It has been demonstrated that dendritic-neuron-based visual processing models simulate biological structures by neurons and can effectively process the signals, both spatially and temporally. This relies on the dendritic structure, a fundamental component of neurons that enables nonlinear computations, unlike perceptron-based artificial neural networks, which often overlook these complex interactions [
36,
37,
39]. Compared to multi-layer perceptrons (MLPs), which process input signals through fully connected linear layers followed by point-to-point nonlinearity, our dendritic neuron model introduces localized bio-inspired nonlinear computation. This allows the synaptic and branch structure to modulate the input signals and utilizes the biological structural dendrite, which is ignored in MLPs. Furthermore, unlike MLPs, where all neurons within a layer share the same activation function and global parameters, dendritic neuron models assign unique learnable parameters to each propagation pathway. The set of parameters in each branch allows more localized representations in dendritic neurons to capture the features [
57]. Ganglion cells play a crucial role in the visual system as the final stage in the retinal network before relaying visual information to the brain through the optic nerve [
58].
In this simulation, the dendritic ganglion cells receive the signals from the HC-equivalent neurons in the retina, as the outputs of bipolar cells are influenced by HC inhibition [
24]. Bipolar cells adjust their responses based on input from photoreceptors and horizontal cells, which regulate signal transmission through HC inhibition. When an HC inhibits a bipolar cell, the bipolar cell’s output is suppressed. Thus, the bipolar cell’s response can be regarded as reflecting the inhibitory influence of HCs. The visual system then determines the brain’s final judgment on motion direction detection by utilizing the output from these neurons. These neurons themselves are not inherently direction-selective but contribute to motion processing through their connectivity with direction-selective retinal ganglion cells [
32].
Our dendritic neuron model, in which the ganglion cells consists of four layers, is illustrated in
Figure 3.
Figure 3a shows the biological structure of a neuron with dendritic structure, while the red strings are the synapses and branches, red circles are the connected nodes, and the blue circle is the cell body with the membrane colored black.
Figure 3b presents the dendritic neuron model structure, which includes the inputs, synapse, branch, membrane, and soma layer. The red triangles represent the nodes. The white circle is connected to the input, and since the connection position is not sure, we connect all the inputs with all the branches for every soma.
Figure 3b is also a dendritic model that has
I synapses (
) or
I inputs (
) and
J branches (
) and
M outputs (
), and it illustrates the process on the
ith synapse and
jth branch for a specific output
m. The sigmoid function is used as the activation function of the dendritic neuron model. The dendrite presents the signal strength from the synapse to the branch, which ends up with the connection to the soma. The sigmoid function enhances the difference near the threshold, which enlarges the exicitatory or inhibitory signal but maintains the gradient, so the whole system is able to learn from the labeled dataset. The process is a follows: 1. The synapse layer receives input signals from the HC-equivalent cells after processing according to the following equation:
where each synapse outputs the corresponding signal
.
is the distance parameter, which presents the length of the synapse or the distance from the
ith input to the
jth branch of output
. The
and
values are the learnable parameters at a specific channel that influence the output of the synapse and finally of the whole system.
2. For the branch layer, each branch
processes the synaptic outputs for branch-level outputs
using a nonlinear algorithm, which can judge the interaction relationship of incoming signals through the following equation:
This nonlinear algorithm enables the system to adjust its learning outcomes based on the connection state. The sigmoid function ensures that the output remains close to 0 when the corresponding synaptic connection has minimal influence from the corresponding branch on the decision-making process in the soma. Conversely, it maintains a constant value of 1 when the synaptic inputs have little effect on the branch. This behavior depends on the specific relationship between the parameters , , and 0, which are given by the learning process.
3. The membrane layer plays a crucial role in integrating the outputs from the branch layer. Each branch contributes to the membrane through a weighted sum, where the weight
determines the influence of the
jth branch. This process is mathematically represented as
where
denotes the output of the
jth branch, and
acts as the weight that scales the contribution of each branch to the overall membrane layer output. This weighted integration ensures that the membrane layer effectively combines information from multiple branches, forming a unified input for the next stage of processing.
4. The soma layer or the cell body layer serves as the final processing stage, where the membrane output is transformed into the dendritic neuron’s output through the activation function. This uses the same activation function with other layers, which introduces a threshold
and a slope parameter
. The output
is computed as
In this equation, represents the threshold that the membrane potential must exceed to activate the neuron, while () controls the steepness of the sigmoid curve, determining the sharpness of the activation. Together, these parameters allow the soma layer to generate a smooth, continuous output that reflects the neuron’s response to the integrated input.
Particularly in the branch layer, our model implements nonlinear computations that distinguish it from CNNs. This mechanism allows each input to contribute differently based on learned dependencies, more similar to ViT architectures, which capture relationships between different spatial regions.
Furthermore, the use of a sigmoid activation function enhances the stability of the learning process. The model can dynamically adjust synaptic connections during training, enabling it to learn inhibitory, excitatory, and constant response states. This adaptability is particularly important for direction-selective processing, where the model needs to determine relevant features from dynamic visual inputs.
2.4. Application of the Dendritic Model to Horizontal Cell Direction Selectivity
Direction selectivity, as discovered by Hubel and Wiesel, is a fundamental property of the visual system. Research has also indicated that horizontal cells exhibit direction selectivity, meaning they can influence motion perception at early stages of processing.
In our model, the dendritic structure in the ganglion cells processes local receptive fields by dynamically integrating inputs from multiple branches. This allows the model to capture the spatial and directional information of the objects in the images more effectively, mimicking the behavior of the retina. By focusing on local receptive fields, our model ensures that the motion direction detection is biologically plausible. This process considers both artificial neural networks and bio-inspired vision systems.
In this simulation, the inputs, denoted as , are processed by the dendritic-neuron-based ganglion cell layer. We define the input matrix to represent a single receptive field for motion direction detection (in this scenario, ). This approach enables the model to handle dynamic visual inputs while maintaining the adaptability needed for direction-selective processing.
The synaptic output matrix
is defined with (
2) by
where
and
are the weight and bias matrix, which are defined as
,
.
In the branch layer, the branch output matrix
is defined by
where ⊙ presents the Hadamard product, which represents the elementwise multiplication of corresponding elements in the matrix.
In the membrane layer, the output matrix
is
where
is the additional weight of the branch.
In the cell body layer, the judgment on motion direction is defined with (
5) by
In this scenario, the
is the computational result of the local receptive field
in the
c channel, which results in a local result
. For the whole image, the result is calculated by the whole system as an output vector by sumpooling, which is the most closely related processing way as the global detection or other complex behavior that is the most recognized theory in retinal visual systems [
59]. The global motion detection follows this theory is defined as
This works out a vector with the probability of the motion direction. As normal artificial neural networks, the output shows the result with maximum probability , representing the probability distribution of motion directions.
2.5. Learning Algorithm
The learning process in the proposed model is designed to optimize motion direction detection accuracy while ensuring detection efficiency. The training framework consists of supervised learning, gradient-based optimization, and adaptive loss functions, with each contributing to the model’s learning process.
Specifically, supervised learning enables the model to recognize motion features by mapping input stimuli to the corresponding motion directions. Gradient-based optimization is used to fine-tune the synaptic weights matrix and bias matrix of the dendritic neuron model. Furthermore, the adaptive loss function penalizes incorrect motion direction predictions while reinforcing biologically plausible responses, thereby aligning the learning process with the principles of neural computation.
We also implemented a synaptic pruning mechanism based on the learning process by changing the synaptic parameters, where noncontributing synapses are weakened over time, mimicking biological synaptic plasticity. This ensures that the model remains efficient and adaptive to various motion conditions. By integrating these mechanisms, the proposed model provides a bio-inspired framework for motion detection that CNN-based methods cannot reach in terms of both accuracy and computational efficiency. The direction-selective mechanisms based on retina and dendritic neuron structures facilitates a more biologically faithful simulation of motion perception.
Cross-entropy loss is well suited for classification tasks, as it encourages the predicted probability distribution to closely approximate the true distribution. As a result, the loss function employed in the model is the cross-entropy loss, which is defined as
where
is the correct output for the
mth direction (the labeled output), and
is the output of the model, which is the predicted probability of direction.
Minimizing the entropy
E improves the model’s motion direction detection performance. In order to update the weight of the weight matrix
and bias matrix
using a backpropagation algorithm, their respective gradients with respect to the loss function
E must be derived. We assume that the learning rate is
. Using the chain rule, the gradient of
E with respect to
,
can be worked out as
Similarly, the gradient of
E with respect to
,
is given by
According to (
6)–(
9), the equation can be simplified as
In Equations (
14) and (
15),
represents the derivative of the activation function, which is the sigmoid function. These gradients facilitate the weight and bias updates required for the optimization process.
The learned weight and bias help the dendritic neuron model through the updated Equation (
9). The summation in Equation (
10) ensures that the local motion features extracted at different receptive fields contribute commonly to the final motion direction detection. The process is analogous to biological motion processing mechanisms, where multiple receptive fields integrate local motion information to infer global motion direction.
The model undergoes iterative training using a gradient-based optimization algorithm, continuously refining weight and bias matrix. Through this learning process, the model effectively captures motion direction information and adapts to different motion direction. The overall framework is illustrated in
Figure 4, demonstrating the hierarchical processing structure for motion direction detection tasks. The model processes temporal image sequences, extracting spatial features from receptive fields at different times (
t and
). The structure consists of the outer and inner nuclear layers, followed by the ganglion cell layer, which integrates local motion signals. Computations in synapse and branch layers contribute to direction selectivity, leading to a final motion direction detection. The learning process is optimized using backpropagation to refine the weight parameters, as shown by the dashed line arrow.
3. Experiment Results
Our experiments were implemented in Python 3.9.16, and Pytorch 2.0.0 was used as the library. The simulations were run on a computer with NVIDIA GeForce RTX 3090.
3.1. Dataset Statement
In this study, we utilized a custom dataset designed to evaluate the effectiveness of different neuron network models in detecting motion direction across various scenarios, especially our proposed dendritic neuron model. The dataset enables the models to handle multi-channel inputs and perform motion detection in various conditions. The dataset comprises multi-channel (32 × 32 × 3) images that simulate object motion under diverse environmental conditions. These variations are introduced by modifying the RGB channel parameters value of each pixel to represent different lighting and background scenarios. Additionally, the dataset includes objects of varying sizes: 1, 2, 4, 8, 16, 32, 64, and 128 pixels, with each size including 20,000 images. The object shapes are completely random but remain connected, ensuring continuous structures within the images. The dataset is categorized into eight groups based on the background and object settings, which are defined as follows: ‘random’ (all the pixels are given a newly generated random value from 1 to 255); ‘constant’ (all the pixels are given the same generated random value from 1 to 255); and ‘dark’ (all the pixels are given a 0 value).
Figure 5 shows the sample images of the dataset. Each row in
Figure 5 presents an example before and after motion. The red bounding box highlights the object’s edges before motion, while the red dashed box in the “After Motion” images indicates the previous position for comparison. The motion direction labels and the conditions of the images are shown in the bottom.
To further assess model robustness, we introduced additional noise variations into the dataset. Specifically, salt-and-pepper noise was applied at proportions of 1%, 2%, 5%, and 10% of the total image pixels, with salt (white pixel) and pepper (black pixel) ratios randomized. Additionally, Gaussian noise was incorporated with a mean of 0 and standard deviations of 1, 5, 10, and 20, representing the low, medium, and high noise levels in the experiments under different conditions. These noises allow for evaluating the model’s performance in handling noisy environments, showing the robustness of the models in motion detection tasks.
This simulation dataset helps models to test robustness in real-world applications. The diverse conditions makes it a suitable testing framework for bio-inspired image processing systems.
3.2. Bio-Mimetical Experiment on Branch Number
The number of branches directly influences the computational efficiency of the dendritic neuron model by affecting the computational resource demands. However, since our model employs a multiplication-based nonlinear algorithm in the synapse layer, the computational cost increases significantly with the initial number of dendrites, even though the algorithm itself remains relatively simple.
To evaluate the impact of the initial dendritic number, we conducted a series of experiments to perform comparison, assessing the advantages of our model relationship with dendrites. Previous experiments with a bio-inspired model without horizontal cells indicated that the initial dendrite count had a negligible effect on performance. However, as our current model incorporates horizontal cells, this observation may not necessarily hold.
The evaluation considered multiple metrics, including the accuracy and Compute-Normalized Learning Efficiency (CNLE) as
L. The CNLE provides a fair metric for comparing models that differ in computational resource consumption and is defined as
where
represents the batch size used in training (set as 128 for standardization).
is the improvement in performance over iterations, which is defined by the reduction in cross-entropy loss
E.
is a coefficient used to ensure that the value of
L remains within a suitable range (
here).
R denotes the total training time, which is measured as the duration required to reach a successful training outcome (measured in seconds).
P represents the GPU utilization percentage, accounting for the computational resources occupied by the model.
This metric ensures fairness when comparing models with different batch sizes and memory requirements, making it particularly suitable for resource-constrained environments. A model with a higher CNLE achieves comparable learning outcomes while consuming fewer computational resources. For the definition of a “successful training” standard in this paper, we first determined the convergence accuracy through extended training epochs. Each model underwent 10 training epochs by suitable epoch, and the final accuracy was averaged. Training was considered successful when the accuracy stabilized within 0.05% of the long-term value, with a standard deviation below 0.1% over 10 epochs to approach the local minimum value. To ensure the training success rate of the model, we selected the maximum epoch count needed from 10 trials, and the runtime required to reach this epoch was used for the CNLE computation. This approach will remain applicable in the subsequent content of this paper.
In this paper, we tested the performance of 1, 5, 10, and 20 initial dendrites. We randomly selected 1250 images from each of the eight datasets, totaling 10,000 images. Among them, 7500 images were used for training and 2500 for testing. The model performance of different initial dendrite branch numbers is shown in
Table 1. To distinguish it from our previous model, we used the HCdM (horizontal cell-based dendritic neuron model AVS) to represent the proposed model in the
Section 3. In this section, all numerical values in the tables follow the same format. The accuracy values are expressed as mean values with four significant digits, which are accompanied by their standard deviation rounded to three significant digits. The CNLE values are reported with three digits after the decimal point to ensure comparability across models and configurations.
The experiment shows that while increasing the number of dendritic branches has influence little on the accuracy, it does influence the computational efficiency. The CNLE values increased from one to five branches, indicating improved efficiency, but dropped dramatically as the branch number increased further. This suggests that a moderate number of branches balances accuracy and efficiency, whereas excessive branches introduce computational overhead without substantial accuracy gains. Additionally, the HCdM with a higher branch number showed more stable convergence. Since the CNLE values for training and testing remained close values, evaluating training the CNLE alone is sufficient when applying the HCdM.
To evaluate the model performance comprehensively, we conducted experiments across all eight datasets. Cross-validation among the dataset was further applied to assess the model’s robustness and CNLE across different initial dendrite numbers. The cross-validation result is shown in
Table 2.
When both the background and object colors were assigned ‘random’, the number of motion features in a single image was significantly higher. This allowed the model to detect motion patterns more quickly, but at the cost of reduced generalization, leading to lower cross-accuracy. Conversely, when the features were limited, such as in the “Constant–Constant” setting, the model detected motion less efficiently but captured more details, resulting in higher cross-accuracy. Additionally, as the branch number increased, the cross-accuracy did not always improve consistently. In some cases, such as in the “Random–Random” condition, the cross-accuracy first decreased and then slightly rebounded at 20 branches. This suggests that while increasing dendritic branches can enhance feature extraction within a dataset, excessive branches may lead to overfitting, limiting the model’s ability to generalize across different datasets. Furthermore, the CNLE values generally decreased with more branches, indicating more stable convergence. However, the stability that more branches brings cannot be ignored.
Additionally, to evaluate the robustness of models with different branch numbers, we conducted noise resistance tests. These tests involved adding salt-and-pepper noise and Gaussian noise to our dataset and assessing the model’s performance under these conditions. The noise immunity test result is shown in
Table 3.
The experimental results indicate that the number of branches has little impact on model robustness. After undergoing synaptic-learning-based pruning in the HCdM, the dendritic structures exhibited similar patterns, leading to close performance across different branch numbers. Notably, the model demonstrated strong robustness, particularly against salt-and-pepper noise. Though salt-and-pepper noise over 5% had already surpassed the size of more than half of the objects in our dataset, the model still maintained a high accuracy. Additionally, when dealing with Gaussian noise, the HCdM maintained high accuracy as long as the standard deviation remained around or below the human color discrimination threshold, which is a difference of approximately 3 in RGB values. However, when the noise level significantly exceeded this threshold, its impact on the model’s performance became more pronounced.
3.3. Comparison Models
For the task of detecting the motion direction in consecutive frames, it is essential to select models that can efficiently process multi-channel, sequential image data while capturing both spatial and temporal relationships. The state-of-the-art (SOTA) models chosen for comparison here include two kind of tailored ViTs—a dual-stream ViT and spatiotemporal ViT—along with CNN models: EfficientNet, ResNet, and ConvNeXt. Below is a detailed explanation of each model and its role in the context of detecting motion direction.
The dual-stream vision transformer (2sViT) processes frames through dual parallel streams with shared ViT weights, where spatial features are independently extracted from each frame for motion detection. While parameter sharing reduces computational overhead for small datasets, this approach needs more resource than CNNs and exhibits limited performance without data augmentation. The 2sViT excels in isolating the spatial features of each frame and then learning the temporal dynamics by combining these features. This separation helps the model focus on specific characteristics in each frame while also understanding how the objects move between them.
The spatiotemporal vision transformer (stViT) integrates spatial and temporal information in an integrated pipeline by processing both frames as a single input. Unlike dual-stream approaches, the stViT encodes motion dynamics through self-attention, capturing more motion detection features across frames. To further enhance temporal modeling, we employed long short-term memory (LSTM) between frame representations for spatiotemporal features. This modification improves the motion direction detection by better preserving sequential dependencies compared to traditional stViT architectures. This model is particularly useful for tasks where motion needs to be understood in the context of how it evolves over time.
EfficientNet-B0 (EfNB0) is a CNN that optimizes model depth, width, and resolution through compound scaling, achieving high computational efficiency without the accuracy decreasing. It is especially effective for large batch processing, making it suitable for motion detection tasks, since it can quickly process large volumes of data, making it an effective model for detecting movement in consecutive frames while managing computational costs. EfficientNet employs a depthwise separable convolution approach, which reduces the number of parameters and computational complexity while maintaining accuracy.
ResNet50 is a well-established CNN structure that utilizes residual connections to alleviate the vanishing gradient problem, enabling the training of deep networks. These connections help the network learn complex features without performance degradation as the depth increases. ResNet50 was chosen for motion direction detection because of its balance between resource consumption and accuracy compared to other ResNet models in preprocessing. The depth of ResNet50 allows the model to extract detailed features from both individual frames and their differences. This capability is essential for capturing spatial and temporal patterns critical for detecting motion between consecutive frames.
ConvNeXt is a modern convolutional model that combines the strengths of traditional convolutional neural networks with the advancements of transformer-based architectures. The ConvNeXt model is well suited for processing large volumes of multi-channel data, such as consecutive image frames. Its ability to efficiently process both spatial and temporal features makes it effective for motion detection, as it can analyze both overall movement and local changes within frames. Based on preprocessing, we selected the ConvNeXt tiny model for its optimal balance between efficiency and performance.
The AVS is based on dendritic neurons and is a specialized model for motion direction detection designed by the authors. It mimics biological visual pathways that include retinal structure and LGN cells, and it is optimized in the HCdM and for synaptic learning. This structure enhances the AVS’s ability to process motion direction efficiently while maintaining biological plausibility.
Each model has a distinct advantage depending on the computational constraints and the complexity of the motion patterns that need to be detected. Combining these models with a dataset designed to simulate object motion under various conditions allows for robust testing and comparison of their ability to handle multi-modal image sequences effectively. To ensure the fairness and reproducibility in the simulations across all models, we followed the same training schedule used in the
Section 3.2. Specifically, each model was trained on 7500 images and tested on 2500 images randomly chosen from the dataset. The simulation scoring was based on the accuracy and CNLE in order to enable direct performance comparison. To ensure fairness in model learning, we used the same batch size of 128 and learning rate of 0.001 settings for all models. This batch size was selected to adapt the large-scale models like ViT and ConvNeXt. While our HCdM model is capable of training with much larger batch sizes due to its lower computational demand, we maintained this level to preserve experimental consistency. To further reduce the overfitting risk, we applied L2 regularization to ConvNeXt, particularly in simulations where the model failed to be stable. This adjustment highlights the structural limitations of ConvNeXt in handling aiming motion detection tasks. As CNLE-based efficiency has been clearly analyzed in the
Section 3.2, the HCdM efficiency was shown to vary by branch count while remaining high accuracy. Readers interested in the detailed performance tradeoffs of the HCdM are referred to that section. Similarly, the cross-validation results for the AVS model have already been published in previous work using the same evaluation methodology [
44].
Since CNNs sometimes struggle with multi-modal images like two frames, we used image concatenation. Although optical flow and difference methods capture some motion features, their detection performance is still inferior to that of handling concatenated images in preliminary experiments. Particularly, we evaluated not only the sparse optical flow method but also implemented dense methods such as the Kanade–Lucas–Tomasi tracking method and the Horn–Schunck method. These flow maps were fed into CNN-based models for comparison. Although such preprocessing is theoretically capable of capturing motion, the experimental results were lower than through image concatenation in most cases. This suggests that such optical flow methods fail to provide reliable features in specific motion direction detection scenario. Image concatenation effectively captures multi-modal information, thus improving the accuracy of motion detection. The training results of the SOTA models are shown in
Table 4.
The experimental results highlight the limitations of existing SOTA models in motion direction detection. While all models achieved near-perfect training accuracy results, their test accuracy results dropped significantly, indicating overfitting. This was particularly evident in transformer-based models such as the 2sViT and stViT, although they have a specific structural design for this scenario. The CNNs had higher accuracy results than the ViT models, but the CNLE values indicate that the CNNs’ decision confidence was not consistently reliable. Compared to these models, the HCdM overcomes these limitations by utilizing dendritic-neuron-based processing to capture motion direction features both locally and globally. Unlike SOTA models that overfit or are unable to capture spatiotemporal features, the HCdM is designed to process the continual frame images, ensuring high accuracy and stable performance across training and test datasets. This highlights its superiority in motion direction detection over present models. Since the SOTA models had much lower CNLE values, the comparison with HCdM is meaningless, so the following SOTA model comparison experiment will ignore this coefficient.
To assess the robustness and performance of our model, we conducted several key experiments, including cross-validation and noise immunity. Each experiment was designed to evaluate a specific aspect of model performance under various conditions, ensuring a comprehensive understanding of how well the model generalizes and performs in various scenarios.
3.3.1. Cross-Validation
Cross-validation serves as a crucial technique for evaluating the generalization ability of our model across different visual settings. In these experiments, each neural network was trained on one of the eight different datasets with various color setting and then tested on the mixed dataset. This methodology allows for an in-depth analysis of how the model performs in various object–background combinations, such as light, dark, or random contrasts. By rotating the training and testing sets, we ensured that the model’s robustness and adaptability to unseen visual conditions were thoroughly tested. This simulates some scenarios, where inputs can vary widely and be limited in single position, and highlights the biological realism of our model by simulating the diverse nature of visual inputs. The experimental results are shown in
Table 5.
The specialized ViT models, particularly the stViT, performed well in simple settings where both the background and object colors were unified, achieving high test accuracy. However, when more complex scenarios involving random colors were occupied, their accuracy results decreased sharply. In some cases, such as when both the background and object colors were random, the stViT exhibited almost no learning ability, with the accuracy close to random sampling. This suggests that while the stViT can learn well in constrained conditions, it lacks the robustness to handle diverse real-world scenarios. The 2sViT showed more stable performance across different settings but kept a low accuracy, and its overall effectiveness in motion direction detection remained limited. Although the ViT models showed strong performance on the natural image dataset within large scales, their accuracy results dropped dramatically under dark background conditions in our experiments. This phenomenon can be attributed to be the global attention mechanism among the patches. This mechanism relies heavily on the contrast across the entire image and ends up with attention maps that focus narrowly on the patches around the object’s location, making it difficult to recognize spatial–temporal displacement between frames that contribute to motion direction features in dark background images. Additionally, the patch-based input structure of the ViT weakens localized motion information when visual contrast is low, limiting its ability to capture pixel-level features.
Among the CNN-based models, EfficientNet-B0 demonstrated the strongest performance, maintaining relatively high accuracy under controlled conditions, though it was sensitive to environmental randomness. The other CNN models, such as ResNet50 and ConvNeXt, exhibited poor generalization, with their cross-validation accuracy results remaining notably low.
In contrast, the HCdM demonstrated significantly stronger generalization ability. Unlike the SOTA models, the HCdM effectively captured motion direction features across varying conditions. This highlights its superior ability to extract relevant spatiotemporal patterns, making it a more robust and practical solution for real-world motion direction detection tasks.
3.3.2. Noise Immunity
The noise immunity test evaluates the model’s ability to handle noisy or corrupted inputs, which is essential for ensuring its reliability in various scenario. By introducing different levels of noise to the dataset, we tested the model performance under these conditions. We assessed the model’s ability to maintain accuracy and robustness despite disturbances. Since the data quality may vary, the model must remain effective under these conditions to show their robustness. The noise immunity of SOTA model experiment is shown in
Table 6.
In this group of experiments, the results show that the SOTA models performed poorly under these noisy conditions. This can be attributed to their complex structures’ focus on capturing global features, which ignores the local motion direction features. The EfN showed strong robustness, as its accuracy remained much more stable than other models despite increasing noise levels. However, its overall accuracy in this setting was already too low, making it unreliable for motion direction detection. The poor performance of specialized ViT models (2sViT and stViT) was unexpected, since their patch-based processing should provide local features, which is unsatisfactory. However, it is worth nothing that the ViT-based models exhibited relatively strong robustness against Gaussian noise, which is considerable.
Compared to traditional models, the AVS showed higher robustness against salt-and-pepper noise but struggled in the high-noise-rate scenario. Its performance dropped sharply as the noise levels increased, indicating its limited robustness. Furthermore, the AVS performed notably worse than the HCdM against Gaussian noise. The possible explanation is probably that bipolar cells introduce unnecessary information that leads to misjudgments in motion direction.
In contrast, our HCdM, through its synaptic learning mechanism on unnecessary connections pruning, remained effective even in high-noise environments. This mechanism helps it to detect motion direction without being overly sensitive to global disturbances. The results particularly highlight the HCdM’s superiority against salt-and-pepper noise, though it was relatively weak on Gaussian noise. The experimental results demonstrate the convinced robustness of the HCdM.
3.4. Analysis
The experimental results illustrate a truth that the HCdM remains robust performance across varying branch numbers and datasets. Specially, its synapse pruning, which is based on the learning processing, enables the model to build efficient and approximate dendrite structure regardless of the initial complexity. Compared to SOTA models such as ViTs and CNNs, the HCdM offers significant merits.
While the ViTs struggled with localized feature capturing when facing low-contrast or noisey image, the CNNs tended to overfit and offered limited generalization, and the AVS performed poorly when facing specific noise types and lacks biological theory, the bio-inspired structure of the HCdM ensured stable convergence, high accuracy, and low computational demand. Its localized receptive fields are able to integrate the motion direction features even in challenging environments, which helps to replicate the biological recognition.
Even under severe noise conditions, the HCdM retained impressive accuracy even for salt-and-pepper noise levels that exceeded the object size. These experimental results emphasis its robustness, resource efficiency, and adaptability across datasets.
These findings suggest that the HCdM not only provides biological reasonability but also achieves superior generalization and robustness, making it a convincing direction for bio-inspired motion direction detection models an even useful in real-world applications.
4. Discussion
The results of the cross-validation and noise immunity test show that the HCdM demonstrates strong robustness. This suggests that the bio-inspired mechanism, especially its synaptic learning, enable the model to fight against salt-and-pepper noise and Gaussian noise—which is the most noisey condition we meet in real-world environments. However, an interesting finding is that the HCdM exhibited a more significant drop in accuracy under Gaussian noise than salt-and-pepper noise, which is not the case with human visual perception. An unclear mechanism in retinal structure connected to the biological research is probably the reason. An improvement in the future could involve integrating algorithms from ViT-based structures or EfNs to introduce additional global filtering mechanism and learning strategies, since they show comparable stability for this kind of noise.
Another important point is that compared to another bio-inspired model—the AVS—the HCdM does not utilize the LGN structure, which is poorly explained in neuroscience research. Despite this, the HCdM successfully detects motion direction without relying on LGN-based processing, suggesting that the role of the LGN in motion detection might not be as critical as traditionally assumed. However, without direct biological evidence based on experiments, it remains a theoretical hypothesis. Additionally, modeling more complex cortical interactions may further enhance the bio-inspired model’s ability to generalize across different visual conditions and reproduce these tasks as precisely as in human vision.
Another aspect worth exploring is the computational efficiency of the HCdM. The efficiency increases as a result of the synaptic connection and dendritic-neuron-based nonlinear algorithm. Furthermore, the number of branches is adjustable to realize stability levels that can be customized, making the HCdM adaptable to different tasks. Future research could explore hybrid structures, where the HCdM could be combined with self-attention blocks or more efficient feature extraction using SOTA models. These improvements make it possible to design a model that can balance plausibility, robustness, and computational efficiency.
When compared with other studies, quantitative comparisons show that HCdM achieved over 99.5% test accuracy and over 80% cross-accuracy, which are higher than SOTA models, both ViTs (max 50.1%) and CNNs (max 61.7%), under the same experimental settings. Moreover, the HCdM retained its accuracy results above 60% even under 10% salt-and-pepper noise that influenced the integrity of the image, whereas the ViTs and CNNs dropped to around 20%. This confirms the HCdM’s robustness and generalization performance in quantitative terms.
In terms of general applicability, the HCdM is theoretically suitable for motion direction detection tasks, especially where pixel-level motion happens, such as detecting micro-movements in medical imaging or slight displacements in mechanical part monitoring. The ability to extract motion features with low-level computation and the ability to process multi-channel data make the HCdM generalizable. Since the HCdM is better suited to short-term motion detection tasks with specific direction, it is limited with respect to finishing tasks that rely on frame-to-frame displacement vectors compared to optical flow-based methods. In terms of speed, it has low computational requirements and thus runs faster on limited hardware than ViTs or CNNs.
One limitation of the HCdM is that although the HCdM reduces computational costs, the memory footprint may still increase if the branch number is excessively high. This is because each dendritic branch maintains independent parameters. Hence, it is necessary to have a balance between stability and memory efficiency before the total experiments. Additionally, while the HCdM runs efficiently on small-to-medium datasets, it may require structural optimization to scale to large datasets like ImageNet or long video sequences. These limitations should be addressed when deploying the HCdM in large-scale systems. To enhance real-world adaptability, available tools include hybrid modeling like combining the HCdM with CNN layers or ViT-like attention modules or more suitable hardware for dendritic architectures using neuromorphic chips. Moreover, dynamic pruning algorithms and reinforcement learning can be integrated to adaptively adjust the branch structure in response to input conditions.