Improving Network Training on Resource-Constrained Devices via Habituation Normalization

As a technique for accelerating and stabilizing training, the batch normalization (BN) is widely used in deep learning. However, BN cannot effectively estimate the mean and the variance of samples when training/fine-tuning with small batches of data on resource-constrained devices. It will lead to a decrease in the accuracy of the deep learning model. In the fruit fly olfactory system, the algorithm based on the “negative image” habituation model can filter redundant information and improve numerical stability. Inspired by the circuit mechanism, we propose a novel normalization method, the habituation normalization (HN). HN first eliminates the “negative image” obtained by habituation and then calculates the statistics for normalizing. It solves the problem of accuracy degradation of BN when the batch size is small. The experiment results show that HN can speed up neural network training and improve the model accuracy on vanilla LeNet-5, VGG16, and ResNet-50 in the Fashion MNIST and CIFAR10 datasets. Compared with four standard normalization methods, HN keeps stable and high accuracy in different batch sizes, which shows that HN has strong robustness. Finally, the applying HN to the deep learning-based EEG signal application system indicates that HN is suitable for the network fine-tuning and neural network applications under limited computing power and memory.


Introduction
At present, many applications based on neural networks are embedded in portable devices to monitor the IoT system in real-time. However, most of the portable devices are resource-constrained, such as limited power, limited computing power and limited memory space. Training/fine-tuning neural networks on resource-constrained devices often requires setting different training parameters from the original networks, which may lead to a decrease in accuracy and affect the final application. For example, when finetuning on embedded application systems, a smaller batch size often significantly affects the accuracy of the neural network. Through analysis, we find that the accuracy drop of fine-tuned neural networks is related to the sensitivity of normalization to the batch size.
Normalization can improve the training efficiency and generalization ability of neural network models. Therefore, normalization has been an influential component and an active research field of deep learning, promoting the development of some research fields, such as computer vision and machine learning. Among normalization methods, the batch normalization (BN) [1] normalizes by calculating the mean and variance within a batch of data before the activation. BN helps to stabilize the distribution of internal activations during model training. Numerous experiments show that BN can effectively improve the learning efficiency and the accuracy of deep learning networks [2]. BN is a foundation of many state-of-the-art computer vision algorithms and is applied to the latest network architectures. BN, with great success, is not without drawbacks. For example, on the ResNet-50 model trained in CIFAR10, BN performs well with a sufficiently large batch size (e.g., 32 images per worker). However, a small batch size leads to an inaccurate estimation of the mean and variance within a batch, leading to reduced model accuracy ( Figure 1). In addition, BN cannot be effectively applied to recurrent neural networks (RNNs) [3]. In response to this problem, some normalization methods have been proposed. For example, Layer Normalization (LN) [3], Weight Normalization (WN) [4], Instance Normalization (IN) [5], Group Normalization (GN) [6], Attentive Graph Normalization (AGN) [7], etc. GN has higher stability among these normalization methods, but lower performance in medium and large batches. As a particular case of BN and LN, IN only considers all elements of a channel in one sample to calculate the statistics, which is more suitable for fast stylization. LN is mainly applied to recurrent neural networks (RNNs) and is rarely used in CNN. Therefore, it is necessary to explore a new normalization method with high stability and suitability for different network types. Habituation is a type of non-associative plasticity in which neural responses to repeated neutral stimuli are suppressed over time [8]. Habituation in biology is applied in robotics applications [9,10] and deep learning networks to enhance object recognition [11]. Habituated models are also applied to information filtering, pattern classification and anomaly detection to improve the anomaly detection accuracy [12]. These studies reveal the benefits of using habituation in machine learning and suggest that the models incorporating additional features of habituation could yield more robust algorithms. In this paper, we propose a habituation normalization method (HN) based on the habituation "negative image" model by calculating the suppressed image of the input data and then normalizing the input by subtracting the inhibitory image. HN uses batches of data to construct the inhibitory picture and achieves a batch size independent normalization method. It can also effectively eliminate noise or confusion in the statistical calculation. For example, training ResNet-50 on CIFAR10 with a batch size of 4, BN achieves the average test accuracy of the last five epochs of 56.58% while HN achieves 72.54% with notable improvement.
The main contributions of this paper are: 1.
We proposed a new normalization method, Habituation Normalization (HN), based upon the idea of habituation. HN can accelerate the convergence and accuracy of networks in a wide range of batch sizes.

2.
HN helps maintain the model's stability. It avoids the accuracy degeneration when the batch size is small and the performance saturating when the size is significant.

4.
The application of HN to deep learning-based EEG signal application system shows that HN is suitable for deep neural networks running on resource-constrained devices.
In the remainder of this paper, we first introduce the works related to normalization and habituation in Section 2. Then the formulation and implementation are discussed in Section 3. In Section 4, the experimental analyses of HN are performed. Section 5 is a case study. Section 6 concludes the paper.
Ioffe and Szegedy proposed the batch normalization (BN) [1] in 2015. First, BN normalizes a feature map with the mean and variance calculated along with the batch, height, and width of the feature map. Then, BN re-scales and re-shifts the normalized feature map. It is widely used in CNN networks with significant results [14,15] but less applicable to RNN and LSTM networks. In addition, BN leads to the deterioration of network accuracy when the batch size is small.
In 2016, Ba, Kiros and Hinton proposed the layer normalization (LN) [3]. LN computes the mean and variance along a feature map's channel, height, and width dimensions and then normalizes it. LN and BN are perpendicular to each other in terms of the dimensions where they find the mean and the variance. LN requires the same operation in the training and testing processes. It solves the problem that BN is unsuited for RNN, and at the same time, achieves good results when setting a small batch size. However, LN is still less accurate than BN in many large image recognition tasks.
Salimans and Kingma proposed the weight normalization (WN) [4] in 2016. WN decouples the weight vector into a parameter vector v and a parameter scalar g to reparametrize and optimize these parameters by stochastic gradient descent. Unlike BN and LN, WN has a special idea of parameter normalization. WN also accelerates the convergence of stochastic gradient descent optimization.
In 2016, Ulyanov, Vedaldi, and Lempitsky proposed the instance normalization (IN) [5]. IN takes all elements of a single sample, a single channel in a batch sample to calculate the mean and variance, and then normalizes. IN is mainly applied in style transfer to accelerate model convergence and maintain the independence between image instances.
In 2017, Ioffe proposed the batch renormalization (BRN) by adding two non-training parameters, r and d, to BN [13]. BRN keeps the equivalence of the training phase and the inference phase, and solves the problems of non-independent identical distribution and small batch. Although BRN solves the problem of the BN's accuracy reduction in minor batch sizes, BRN is still batch dependent. Therefore, its accuracy is still affected by the batch size.
Wu and He proposed the group normalization (GN) [6] in 2018. GN divides the channel data into groups and calculates the mean and the variance of the channel, height, and width dimensions on each group. LN, IN, and GN all perform in dependent computations along the batch axis. The two extreme cases of GN are equivalent to LN and IN. Although GN is batch size independent, it needs to be divided into G groups. Therefore, its stability is between IN and LN.
Chen et al. proposed the attentive graph normalization (AGN) [7] in 2022. AGN learns a weighted combination of multiple graph-aware normalization methods, aiming to automatically select the optimal combination of multiple normalization methods for a specific task. However, it is limited to graph-based applications.

Biological Habituation and Applications
Habituation [8,9] is a form of simple physical memory. Over time, habituation inhibits the neural responses to repetitive, neutral stimuli, that is, behavioral responses will decrease when stimuli are perceived repeatedly. Habituation is also considered to be a fundamental mechanism of adaptive behavior, which is present in animals ranging from the sea slug Aplysia [16,17] to humans [18] through toads [19,20] and cats [21]. This adaptive mechanism allows organisms to focus their attention on the most salient signals in the environment, even when these signals are mixed with high background noise. Some researchers [9,22] investigated the mechanism of short-term habituation in the fruit fly olfactory circuit and tried to reveal how habituation in early sensory processing affects the downstream occurrence of odor encoding and odor recognition. For example, a dog sitting in a garden that is habituated to the smell of flowers is likely to detect the appearance of a coyote in the distance, even though the odor of the coyote in the distance is only a tiny fraction of the original odor that enters the dog's nose ( Figure 2). Figure 2. When a dog sits in the garden and get used to the smell of flowers, the dog can perceive any changes in the environment (for example, a coyote that appears in the distance) [9]. In this scene, the dog's smell of flowers gradually fades away, when the dog is habitual. Then the new coming smells will be magnified and be detected easily.
The effect of habituation on background elimination has also attracted the attention of computer scientists. Some computational methods that demonstrate the primary effects of habituation (i.e., background subtraction) have been used in robotics applications [9,10] and deep learning networks to enhance object recognition [11]. In 2018, Kim et al. applied the background subtraction algorithm [11] to each video frame, finding the region of interest (ROI). They then performed CNN classification and obtained the ROI as one of the predefined categories. In 2020, Shen et al. implemented an unsupervised neural algorithm for odor habituation in fruit fly olfactory circuits [9] and published the work in PNAS. They used background elimination to distinguish between similar odors and improved the prospect detection. The method improves the detection of novel components in odor mixtures.
Studies in [8][9][10][11]22] revealed the benefits of using habituation in machine learning or deep learning and suggested that models that incorporate additional features of habituation yielded more robust algorithms.
In this paper, based on the features that habituation can filter redundant information and make the values stable, we design a habituated normalized layer (HN) for neural networks. It can enhance the training efficiency of the network and improve the model accuracy.

Method
In this section, we first review existing normalization methods and then propose the HN method with stimulus memorability.

The Theory of Existing Normalization
The existing normalization methods calculate statistics from some dimensional ranges of the batch data and normalize. The objectives are unifying the magnitudes, speeding up the solution of gradient descent, avoiding neuron saturation, reducing gradient disappearance, preventing the small values in the output data from being swallowed and avoiding numerical problems caused by the large values. Take CNN as an example. Let x be the input data to an arbitrary normalization layer, represented as a 4-dimensional tensor [N, C, H, W], where N is the number of samples, C is the number of channels, H is the height, and W is the width. Let x nchw andx nchw be pixel values before and after normalization, Assuming that µ and σ are a mean and a standard deviation, respectively, the values normalized by BN, LN, and IN can all be expressed as (1).x where α and β are a scale and a shift parameter, respectively, is a tiny constant.
Equation (1) summarizes the three normalizing calculation formulas of BN, LN, and IN. The only difference is that the pixels used to estimate µ and σ are different. µ and σ can be expressed using (2) and (3).
where i ∈ bn, ln, in is used to distinguish different methods, S i is a set of pixels, |S i | is the size of S i . BN counts all pixels on a single channel, which can be expressed as LN counts all pixels on a single sample, which can be expressed as IN counts all pixels of a single channel of a single sample, which can be expressed as For GN, G is the number of groups, which is a predefined hyperparameter (G = 32 by default). In GN, the size of the tensor is [N, G, C/G, H, W]. Let x ncghw andx ncghw be pixel values before and after normalization, GN divides the channels into several groups, and then counts all the pixels in each As can be seen from the above description, BN, LN, IN, and GN are all dependent on the data within a batch for calculating the mean and standard deviation. They do not consider the correlation between batches.

Habituation Normalization
Habituation has three general features in biology: • Stimulus adaptation (reduced responsiveness to neutral stimuli with no learned or innate value). • Stimulus specificity (habituation to one stimulus does not reduce responsiveness to another stimulus). • Reversibility (de-habituation to context when it becomes relevant).
These features are closely relevant to computational problems, yet they have not been well applied.
Some researchers have established mathematical models of the habituation effects on the efficacy of a synapse, including Groves and Thompson [23], Stanley [24], and Wang and Hsu [25]. The model proposed by the Wang and Hsu considered a long-term memory, where the long-term memory means that an animal habituates more quickly to a stimulus to which it has been previously habituated. Shen et al. [9] developed an unsupervised algorithm ( Figure 3). Inspired by habituation-related studies, we propose a novel habituation normalization method (HN) applicable to deep neural networks. The "negative image" in HN is a weight vector v, which is initially a zero vector and has a shape of [1, C, 1, 1]. At iteration t, input x nchw is adjusted with (7).
where the weight vector v is updated with (8), and the shape of v will match the shape of the input data Then, the mean of v is calculated on the channel dimension, and the shape of v is adjusted to [1, C, 1, 1] to facilitate the following input (9).
In the habituation method, the "negative images" are saved by vector v. If every batch of data is similar, we expect to form a "negative image" of the input with the time going on. After subtracting the "negative image" from the following input, what remains is the foreground components of the images. With HN, the construction of "negative image" is a gradual process. Therefore, the construction process considers the present batch data and the influence of the previous batch data simultaneously. Equation (9) removes the batch size factor after constructing a "negative image" process via (8). Equation (9) ensures that the HN is independent of the batch size. Therefore, it can be applied to different batch sizes.

Implementation
HN can be implemented by code in the popular neural network framework Pytorch. Figure 4 shows the code based on Pytorch.

Experiment
In this section, we evaluate the effectiveness of HN on two benchmark datasets and three different deep learning networks.
Two datasets are depicted in the following.
1. FASHION-MNIST [29]: FASHION-MNIST clones all the irrelevant features of the MNIST dataset: 60,000 training images and corresponding labels, 10,000 test images and related labels, 10 categories and 28 × 28 resolution per image. The difference is that FASHION-MNIST is no more extended abstract symbols but more concrete human necessities-clothing, with 10 types. 2.
CIFAR10 [30]: this dataset consists of 60,000 color images with 50,000 training images and 10,000 test images of 32 × 32 pixels, divided into 10 categories.
In the experiments, all deep learning models use cross-entropy loss, sigmoid as activation functions in convolutional neural networks, and ReLU as activation functions in residual networks. BN, LN, GN, BRN (https://github.com/ludvb/batchrenorm), and optimizer keep the default hyperparameters. As to HN, we set γ = 0.5 , ϕ = 0.1, t = 4 as the default settings. Following the idea of using simple networks by Ioffe and Szegedy [1], we build the vanilla convolutional neural network ( Figure 5) according to the LeNet-5 structure proposed by LeCun [13]. The LeNet-5 consists of 2 convolutional layer blocks and 2 fully connected layer blocks. Each convolutional layer block includes a convolutional layer, a sigmoid activation function, and a maximum pooling layer. Each convolutional layer uses a 5 × 5 convolutional kernel. The first convolutional layer has 6 output channels, while the second one has 16 output channels. In the two maximum pooling layers, we set the kernel size to 2 × 2, stride to 2. To pass the output of the convolutional block to the fully-connected layer block, each sample is spread in a small batch. Three fully connected layers have 120, 84, and 10 outputs. In the experiment, normalizations are inserted before sigmoid activation function. We did not apply any data enhancement methods to the FASHION-MNIST and CIFAR10 datasets. Each model was trained using Adam Optimizer with a learning rate of 0.001.
The first experiment was conducted on the FASHION-MNIST dataset. When batch size = 2, the classification accuracy of BN is much lower than that of vanilla CNN, LN, and HN (Figure 6a), which once again verifies the limitation of BN (degenerating when the batch size is small). At the same situation, the accuracies of HN and LN are keeping stable and insensitive to the batch size. HN converges faster than LN and can quickly reach the highest accuracy. With the increase in epoch, the vanilla CNN has the phenomenon of overfitting, which makes its test accuracies lower than those of BN and LN.
When batch size = 4, HN outperformed BN, vanilla LeNet-5, and LN in terms of convergence speed and accuracy at the beginning (Figure 6b). The test accuracies of vanilla LeNet-5, HN, LN, and BN become closer when the epoch greater than 12. Their final test accuracies differed very little.
When the batch size is 8 and 16, HN, LN, and BN converge faster than vanilla LeNet-5 (Figure 6c,d). BN slightly outperforms HN and LN in the first 5 epochs. With the increase in training epochs, their test accuracies are basically the same. Figure 6c,d show that both HN and BN can effectively improve the convergence speed of the network. From the above analysis, we can find that BN still has the problem of accuracy degradation when the batch size is small in the FASHION-MNIST dataset. Our normalization method HN adapts to a wide range of batch sizes and dramatically improves the convergence speed and accuracy of the vanilla network.
Then, we applied these methods to the color dataset CIFAR10. Compared to the gray images, the color images have more data features. So, we additionally add GroupNorm (GN) and BatchRenorm (BRN) for comparison. Due to the simple network and limited pipeline size, we set G = 2 for GN. BRN keeps the original setting. When the training epochs size is 60, the average accuracies of the last 5 epochs are shown in Table 1.

VGG16
Due to the relatively simple structure of the LeNet-5, this section additionally adds a popular deep convolutional neural network VGG16. We trained, respectively, VGG16 without normalization layer (Vanilla) and VGG16 with BN or HN on the FASHION-MNIST dataset. As before, we optimized using Adam for 30 epochs, setting the initial learning rate to 0.001 and the batch sizes to 2, 4, 8, and 16. For each batch size, the curves of accuracy vs. epoch are shown in Figure 7, and the average accuracies of the last 5 epochs are shown in Table 2.  It can be seen from the above analysis that when the size is small, VGG16 has a huge difference in the effect of adding or not adding a normalization layer to the FASHION-MNIST dataset. BN still has the problem of reduced accuracy when the batch size is small in deep convolutional neural networks. However, the HN proposed in this paper is still adaptable to all batch size cases and deep convolutional neural networks. Additionally, HN is added to the original VGG16 network as BN method, which greatly improves the convergence speed and accuracy.

Comparisons on Residual Networks
We have analyzed the effectiveness of HN in vanilla LeNet-5 and VGG16. In this section, HN is applied to the popular ResNet-50 network to further validate the adaptability. In 2016, He et al. first proposed ResNet-50, which has 16 residual blocks containing three convolutional layers of different sizes. While comparing the effectiveness of normalization methods, we do not use training techniques, such as data augmentation and learning rate decay. The original data are read-in for network training to ensure that the comparison of different normalization methods is not affected by preprocessing.
The baseline model is ResNet-50, containing BN in its original design. The datasets used in this subsection are FASHION-MNIST and CIFAR10. In the baseline model, normalization is used after the convolution and before the ReLU. We swap it into the model in place of BN to apply HN. Adam is used as a training optimizer with a learning rate of 0.001. Let the training Epoch be 30, mini-batch sizes be 2, 4, 8, 16, and 32. In addition, we add GN, BRN for comparison. For GN, we use the recommended parameter settings in [6], where G = 32. For BRN, we keep the default settings in the source code.
The experimental results of the tests on the FASHION-MNIST dataset are shown in Table 3. To reducing the effect of random variation, the average test accuracies for the last five epochs are listed. The experimental results show that BN does not work well when batch size = 2. BRN converges, but has low test accuracy. HN and GN both have significant results when batch size = 2. In other batch size settings, their test accuracies are very close. For the CIFAR10 dataset, we use the default ResNet-50 settings. Table 4 shows the average test accuracies for the last 5 epochs. The results of GN@G = 32 are not good when batch size = 2, which may be caused by the invalidity of statistics calculation. So GN@G = 2 is added for comparison too. When batch size = 2, HN achieves the highest accuracy of 72.26%, which is 0.208 better than GN(G = 2), 0.5174 higher than BRN. When the batch size = 4, the highest accuracy of HN is 0.1596 higher than BN, 0.0234 higher than GN (G = 32), and 0.0102 lower than BRN. In other batch size settings, their test accuracies are very close and stable. The experiment results of ResNet-50 on FASHION-MNIST and CIFAR10 show that BN and BRN are batch size dependent. GN is sensitive to parameter G. HN can keep stable and high accuracy in a wide range of batch sizes.

Memory Requirement Analysis
In this subsection, we show the relationship between memory occupation and accuracy under vanilla, BN, and HN for LeNet-5, VGG16, and ResNet50 network models (Figure 8). The estimated total memory sizes (MB) in Figure 8 correspond to the memory requirements of the models when the batch size is 2, 4, 8, and 16. The test accuracy is the average test values of the last 5 epochs. The estimated total memory sizes are obtained by the summary function of the torchsummary in PyTorch. Due to the space limitation, we only present the experimental results on FASHION-MNIST.
The vanilla networks show minimal memory requirements because of no normalization layer. The memory requirements of BN and HN are very close, and increase with the batch size.  (Figure 8c). Among the three models, HN only needs 60.1%, 64.3%, and 57.8% of the memory requirements of BN with the close accuracy.
From the above analysis, we find that models with small batch size consume less memory and are more conducive to training and applying on resource-constrained devices. Compared with BN, HN achieves higher accuracy in small batch size, so it is more suitable for resource-constrained devices.

Case Study
Brain-computer interface (BCI) constructs a communication pathway between the human brain and external devices directly without passing the muscular system. The BCI technology is widely used in assisted rehabilitation and brain-sensing games. Due to low cost and high-resolution characteristics, the electroencephalogram (EEG) signals is widely used in BCI applications. The process of EEG-BCI includes EEG signals acquisition, signal processing, and pattern recognition. Based on the advantages and features of end-to-end neural networks in pattern recognition, EEG-BCI systems gradually leaves the laboratory and is applied to portable device scenarios, such as embedded systems. As shown in Figure 9, the application of the embedded-based EEG-BCI system includes the following four steps:

1.
Training: in the laboratory experimental situation, collect enough EEG trials to train a deep neural network for patterns recognition.

2.
Deploying: deploy the pre-trained deep neural network model to the embedded device. 3.
Fine-tuning: fine-tune the deep neural network model while acquiring EEG trials.

4.
Applying: apply the fine-tuned and stabilized deep neural network model to the control of the embedded. Figure 9. EEG signal application system based on deep learning and its application process.
Wet electrode-based EEG-BCI system requires the subject to wear an EEG cap and apply conductive paste to each electrode, keeping the resistance of each electrode below 10 kΩ. However, subjects cannot guarantee that the electrode caps will be worn in the same position while migrating from the laboratory situation to the embedded device. Keeping the same resistance of each electrode is even more impossible (only guaranteed to be <10 kΩ). Based on the situation differences, the EEG trial set collected in the embedded device is not consistent with the EEG trial set used in training deep neural network, which no longer satisfy the assumption of independent identically distribution. Due to the characteristics of non-linear and non-stationary of EEG signals, the pre-trained deep neural network model needs to be fine-tuned to adapt to the embedded device situation. However, due to the storage and computational performance bottlenecks of embedded devices, we must set a limited batch size for the fine-tuning process.
In this case study, we use EEG based motor imagery BCI (MI-BCI) as an example to verify the effectiveness of HN when fine-tuning the deep neural network model. ShallowF-BCSPNet (https://github.com/TNTLFreiburg/braindecode/blob/master/braindecode/ models/shallow_fbcsp.py), proposed by Schirrmeister et al. in 2017, is a deep neural network designed for decoding imagined or executed tasks from raw EEG signals, which has a good performance in classifying EEG signals [31]. The BCI Competition IV dataset 2a (BCICIV 2a) is a classical EEG-based MI-BCI dataset. We take this dataset as an example to analyze and compare the performance of the HN and BN. Since BN is embedded in ShallowFBCSPNet originally, we replace BN with HN, and set max_epoch to 1600 and max_increase_epochs to 160. No extra preprocessing is performed on the EEG signals.
First, we conducted experiments in two cases, the original batch size (batch size = 60) and a smaller batch size (batch size = 8), to examine the influence of batch size on the test accuracy of HN. Table 5 shows the best prediction results in 10 epochs before the end of training with batch size = 60, while Table 6 shows the best prediction results in 10 epochs before the end of training with batch size = 8. After replacing BN with HN, the test accuracy was improved on 6 of the 9 subjects, with a maximum improvement of 0.274 (Table 5). There was a slight decrease on the other three subjects, with a maximum reduction of 0.021. Overall, the average accuracy was improved by 0.038, which indicates that HN is more suitable for MI recognition of EEG signals when batch size = 60.
When batch size = 8, Table 6 shows that the test accuracy was improved in 8 out of 9 subjects up to 0.198 after replacing BN with HN. Overall, their average accuracy was enhanced by 0.05401. The experimental results indicate that ShallowFBCSPNet with HN is better than that of BN for MI recognition of EEG signals when the batch size is small. To demonstrate the real application scenario of the deep learning-based EEG signals application system, the experiments were conducted with EEG signals from subject A as training (batch size = 8), followed by fine-tuning (batch size = 2) and testing on subject B as the user. We train the model with the EEG signals from subjects 2, 4, 6, 8, and 1, and fine-tune and test the model on subjects 1, 3, 5, 7, and 9 as users. During fine-tuning, we randomly selected 20% of the EEG signals from subject B. Finally, the best prediction result of the last 10 epochs of the fine-tuned model is shown in Table 7. Table 7. Accuracies obtained under different training and test users. Training model with EEG from subject i (batch size = 8), fine-tuning with 20% EEG from subject j (batch size = 2), and then test on subject j. i→j represents training on subject i, testing on subject j.
is the improvement of test accuracy by HN. As shown in Table 7, when the subject and the user are not the same subject, the model fine-tunes with a smaller batch size. Although there is fewer fine-tuning data, the accuracy of ShallowFBCSPNet decreases greatly, which indicates that the embedded device applications of deep learning-based EEG signal recognition model still have a long way to go. Comparing with HN and BN, HN demonstrates a better accuracy in five pairs of experiments, while BN does not show enough advantages. Overall, the average accuracy of the HN is 4.4% higher than that of BN, which indicates that HN is more suitable for deep neural network recognition model on resource-constrained devices.

Conclusions
Habituation is a simple memory that changes over time and inhibits neural responses to repetitive, neutral stimuli. This adaptive mechanism allows organisms to focus their attention on the most salient signals in the environment. The Drosophila olfactory system is based on a "negative image" model of habituation that filters redundant information and enhances olfactory stability. Inspired by the circular mechanism of the Drosophila olfactory system, we propose a novel normalization method, habituation normalization (HN), with three characteristics of habituation in biology: stimulus adaptation, stimulus specificity, and reversibility. HN first eliminates the "negative image" obtained by habituation and then calculates the overall statistics to achieve normalization.
We apply HN to LeNet-5, VGG16, and ResNet-50. Experiments on three benchmark datasets show that HN can effectively accelerate network training and improve the test accuracy. By comparing with other normalization methods (LN, BN, GN, and BRN), the experimental results verify that HN can be used in a wide range of batch sizes, and show good robustness. Finally, we apply HN to a deep learning-based EEG signal application system. Experiment results in two cases (train on A, test on A; train on A, trial on B) show that HN are more suitable for deep learning network applications in resource-constrained devices.
As future work, we will extend HN to other types of deep learning networks, such as recurrent neural network (RNN/LSTM) or Generative Adversarial Network (GAN).