Noise Annoyance Prediction of Urban Substation Based on Transfer Learning and Convolutional Neural Network

The noise pollution caused by urban substations is an increasingly serious problem, as is the issue of local residents being disturbed by substation noise. To accurately assess the degree of noise annoyance caused by substations to surrounding residents, we established a noise annoyance prediction model based on transfer learning and a convolution neural network. Using the model, we took the noise spectrum as the input, the subjective evaluation result as the target output, and the AlexNet network model with a modified output layer and corresponding parameters as the pre-training model. In a fixed learning rate and epoch setting, the influence of different mini-batch size values on the prediction accuracy of the model was compared and analyzed. The results showed that when the mini-batch size was set to 4, 8, 16, and 32, all the data sets had convergence after 90 iterations. The root mean square error (RMSE) of all validation sets was lower than 0.355, and the loss of all validation sets was lower than 0.067. As the mini-batch size increased, the RMSE, loss, and mean absolute error (MAE) of the verification set gradually increased, while the number of iterations and the training duration decreased gradually. In this test, a mini-batch size value of four was appropriate. The resultant convolutional neural network model showed high accuracy and robustness, and the error between the prediction result and the subjective evaluation result was between 2% and 7%. The model comprehensively reflects the objective metrics affecting subjective perception, and accurately describes the subjective perception of urban substation noise on human ears.


Introduction
With the continued popularization of electric vehicles, more charging piles are being installed, creating a corresponding increase in the demand for power supply, which has led to the construction of more urban substations and other basic power grid facilities. However, the location of the new urban substations is increasingly problematic due to the associated noise pollution. With the acceleration of urbanization in China, residential areas have grown up around existing substations, resulting in more residents being disturbed by the noise produced by these substations. The local residents frequently complain of noise pollution, and the social problems caused by residents who are adversely affected by the noise from the substations have become more pronounced. Therefore, the accurate evaluation and prediction of the degree of substation noise disturbance to residents is important in the construction of new substations and with respect to implementing noise control measures for existing substations, which is a common problem in the power grid industry which needs to be solved urgently.
Urban substation noise is usually composed of low-frequency noise caused by transformers and reactors and high-frequency noise caused by transformer cooling systems, although the former is dominant [1][2][3]. Low-frequency noise is characterized by strong penetration, slow attenuation, and long transmission distance, and it is easily perceived by local residents. In addition, the most typical low-frequency substation noise occurs at 100 Hz and its harmonic frequency components are prominent [4], which leads to a high level of subjective annoyance. When residents are exposed to substation noise for a long time, their physical and mental health is seriously affected [5][6][7][8]. Therefore, it is important to explore methods for evaluating noise annoyance and to establish a noise annoyance prediction model that conforms to the noise characteristics of urban substations.
Currently, the noise level of substation equipment is evaluated using the A-weighted sound pressure level (AWSPL) of a sound signal [9][10][11][12]. Chen et al. [10] established a noise annoyance prediction model for urban substations based on commonly used psychoacoustic metrics utilizing the multiple linear regression (MLR) method and concluded that annoyance depends mainly on the AWSPL. Liu [11] established a substation noise annoyance prediction model based on logical regression and stepwise regression analysis, concluding that in addition to AWSPL, loudness and loudness levels also have a significant impact. Another study [12] predicted the noise annoyance of urban substations by establishing a multiple linear regression model, concluding that pure tones of typical frequencies have a significant effect on annoyance [13]. The MLR model is simple and less time-consuming, but its prediction accuracy needs to be improved to better reflect the subjective perception of the human ear [14]. In this paper, a convolutional neural network model is presented as a means of evaluating the degree of urban substation noise disturbance to residents. Compared to traditional sound quality prediction methods that are based on objective acoustic metrics, the convolutional neural network automatically extracts and learns features from images or signals without tedious manual feature extraction and has achieved satisfactory results in many applications [15]. In 2012, the AlexNet convolutional neural network was proposed and won first place with only a 15.3% error rate in the ImageNet image classification competition [16]. In 2013, Wan et al. [17] established an upgraded version of Dropout technology, DropConnect, which converts the fully connected layer into a sparsely connected layer to make the network more generalized and robust. In 2015, He et al. [18] built the ResNet network, and the error rate on the ImageNet dataset set a new record of only 4.49%, which is lower than the human error rate (5.1%). In 2016, Yuan et al. [19] constructed a convolutional neural network model for rolling bearing fault diagnosis. The input data were the grayscale time-frequency map after continuous wavelet transform of the vibration signals of rolling bearings. The training results showed that the model has a strong recognition ability. In 2019, Liang et al. [20] established a sound quality prediction model based on a convolutional neural network after processing the input sound signals of internal combustion engines with the auditory spectrum and short-time average energy. The results showed that the model has a high prediction accuracy. In 2020, Huang et al. [21] used a self-constructed convolutional neural network to evaluate the sound quality of interior noise. They used two-dimensional noise time-frequency images and one-dimensional time and frequency vector data as the input. They found that the network performance was better when using two-dimensional images as the input. In 2021, the same team proposed a convolutional neural network model based on an adaptive learning rate tree (ALRT) to predict the interior sound quality of a pure electric vehicle (PEV) under non-steady state conditions. It adjusts the learning rate adaptively according to the training loss. It was proved that the ALRT convolutional neural network has a better parameter update effect than traditional methods, and its use in future technology is promising [22]. Based on transfer learning, we developed a convolutional neural network model for the degree of noise annoyance under small data sets. Compared to traditional convolutional neural networks, the transfer learning method can make the model achieve the expected effect in a shorter training time, significantly saving the training costs such as time and computing resources, and considerably improving the generalization ability of the model. The remainder of this paper is structured as follows: Section 2 briefly introduces the basic principles of the convolutional neural network and the transfer learning methods. Section 3 develops a convolutional neural network model for the prediction of substation noise annoyance. This section also compares and analyzes the effects of different minibatch size values on the prediction accuracy of the model when the learning rate and epoch remain unchanged. Section 4 discusses the advantages of the proposed model by comparing the prediction results with a well-established multiple linear regression model. Section 5 sets out our conclusions.

Materials and Methods
A convolutional neural network (CNN) is a feedforward neural network [23], and its basic structure includes mainly convolutional layers, pooling layers, and fully connected layers. These layers are stacked hierarchically, as shown in Figure 1 [21]. A convolutional neural network usually takes multi-dimensional data as input, and it automatically extracts input features and performs recognition and classification. It is essentially a non-linear mapping from input to output, which can automatically learn the mapping relationship between input and output without specific mathematical expressions between input and output. The remainder of this paper is structured as follows: Section 2 briefly introduces the basic principles of the convolutional neural network and the transfer learning methods. Section 3 develops a convolutional neural network model for the prediction of substation noise annoyance. This section also compares and analyzes the effects of different minibatch size values on the prediction accuracy of the model when the learning rate and epoch remain unchanged. Section 4 discusses the advantages of the proposed model by comparing the prediction results with a well-established multiple linear regression model. Section 5 sets out our conclusions.

Materials and Methods
A convolutional neural network (CNN) is a feedforward neural network [23], and its basic structure includes mainly convolutional layers, pooling layers, and fully connected layers. These layers are stacked hierarchically, as shown in Figure 1 [21]. A convolutional neural network usually takes multi-dimensional data as input, and it automatically extracts input features and performs recognition and classification. It is essentially a nonlinear mapping from input to output, which can automatically learn the mapping relationship between input and output without specific mathematical expressions between input and output. The traditional convolutional neural network requires tens of thousands of training data, but it is time-and labor-consuming to obtain sufficient tagged data. Transfer learning uses the principle of analogy to transfer the feature information learned from the source domain to the target domain [24]. The two domains are different but somewhat related. The higher the correlation, the easier it is to obtain ideal migration results. By directly calling the trained network framework and migrating it to its own data set, the model can achieve the desired effect in a relatively short training time, fundamentally solving the problem of insufficient training data. At the same time, the software and hardware environment required for transfer learning is more relaxed. The pre-training model is a deep learning framework trained by predecessors using massive data, and pre-training is a way of transfer learning. Instead of training the network from scratch, a scenesimilar pre-trained model is chosen to solve the problem, which greatly reduces the training time. In this study, the AlexNet classical network was selected as the pre-training model based on transfer learning. Since the data set used to train the model was small and the data had relatively simple shallow features such as prominent spikes, the network architecture was completely preserved, save for the output layer. The last SoftMax layer and the classified output layer were modified as the fully connected layer with response number 1 and the regression layer, thereby converting the classification network into a regression network. The AlexNet network architecture is shown in Figure 2 [22]. The traditional convolutional neural network requires tens of thousands of training data, but it is time-and labor-consuming to obtain sufficient tagged data. Transfer learning uses the principle of analogy to transfer the feature information learned from the source domain to the target domain [24]. The two domains are different but somewhat related. The higher the correlation, the easier it is to obtain ideal migration results. By directly calling the trained network framework and migrating it to its own data set, the model can achieve the desired effect in a relatively short training time, fundamentally solving the problem of insufficient training data. At the same time, the software and hardware environment required for transfer learning is more relaxed. The pre-training model is a deep learning framework trained by predecessors using massive data, and pre-training is a way of transfer learning. Instead of training the network from scratch, a scene-similar pre-trained model is chosen to solve the problem, which greatly reduces the training time. In this study, the AlexNet classical network was selected as the pre-training model based on transfer learning. Since the data set used to train the model was small and the data had relatively simple shallow features such as prominent spikes, the network architecture was completely preserved, save for the output layer. The last SoftMax layer and the classified output layer were modified as the fully connected layer with response number 1 and the regression layer, thereby converting the classification network into a regression network. The AlexNet network architecture is shown in Figure 2 [22].

Results
The performance of the method used in this study was verified by the urban substation noise recorded and the corresponding subjective evaluation results of a previous study [12]. These urban substation noise samples were recorded using a PULSE data acquisition system (Bruel & Kjaer type 3050-A-060, Naerum, Denmark) and microphones (Bruel & Kjaer type 4189-A-021, Naerum, Denmark). Noise measurement points were arranged mainly near substation plant boundaries and residential areas to reflect the impact on residents and pedestrians. In order to make the data more diverse and to reflect the impact of noise on the staff in the substation, a number of measurement points were also arranged within the substation plant. A total of 17 measurement points were arranged, and three time periods were selected: 2:00 to 3:00, 9:00 to 10:00, and 20:00 to 22:00. Across the measurement points and time periods, a total of 51 noise samples with a duration of 5 s were collected. From these samples, 43 samples were randomly selected to participate in the training process of the network model and the remaining 8 samples were used to verify the reliability of the network model. Given that the number of samples was too small, each noise sample was divided into multiple segments of 1 s, so that each recorded noise signal was divided into five non-overlapping noise samples; a total of 215 noise sample segments were obtained. Each sample had the same subjective evaluation score (due to the substation noise being steady-state). Three segments of each noise sample were classified as the training set, and the remaining two segments were classified as the test set, so the ratio of the training set to the test set in the dataset was 3:2. Each noise sample was converted into the corresponding frequency spectrum to obtain its frequency characteristics using a MATLAB 2020b (MathWorks, USA). All the calculated frequency spectra were saved as color pictures of the same size. The pictures were used as the input of the convolutional neural network (the pixels of each sample were set to 227 × 227, which is consistent with the input layer size of the AlexNet network). Part of the input spectra is shown in Figure 3. As can be seen in Figure 3, the noise energy of these samples was concentrated at the integer multiple frequency of 100 Hz within 1000 Hz, and pure tones were very prominent. The subjective evaluation score of each sound sample was used as the output of the convolutional neural network.

Results
The performance of the method used in this study was verified by the urban substation noise recorded and the corresponding subjective evaluation results of a previous study [12]. These urban substation noise samples were recorded using a PULSE data acquisition system (Bruel & Kjaer type 3050-A-060, Naerum, Denmark) and microphones (Bruel & Kjaer type 4189-A-021, Naerum, Denmark). Noise measurement points were arranged mainly near substation plant boundaries and residential areas to reflect the impact on residents and pedestrians. In order to make the data more diverse and to reflect the impact of noise on the staff in the substation, a number of measurement points were also arranged within the substation plant. A total of 17 measurement points were arranged, and three time periods were selected: 2:00 to 3:00, 9:00 to 10:00, and 20:00 to 22:00. Across the measurement points and time periods, a total of 51 noise samples with a duration of 5 s were collected. From these samples, 43 samples were randomly selected to participate in the training process of the network model and the remaining 8 samples were used to verify the reliability of the network model. Given that the number of samples was too small, each noise sample was divided into multiple segments of 1 s, so that each recorded noise signal was divided into five non-overlapping noise samples; a total of 215 noise sample segments were obtained. Each sample had the same subjective evaluation score (due to the substation noise being steady-state). Three segments of each noise sample were classified as the training set, and the remaining two segments were classified as the test set, so the ratio of the training set to the test set in the dataset was 3:2. Each noise sample was converted into the corresponding frequency spectrum to obtain its frequency characteristics using a MATLAB 2020b (MathWorks, USA). All the calculated frequency spectra were saved as color pictures of the same size. The pictures were used as the input of the convolutional neural network (the pixels of each sample were set to 227 × 227, which is consistent with the input layer size of the AlexNet network). Part of the input spectra is shown in Figure 3. As can be seen in Figure 3, the noise energy of these samples was concentrated at the integer multiple frequency of 100 Hz within 1000 Hz, and pure tones were very prominent. The subjective evaluation score of each sound sample was used as the output of the convolutional neural network. first-order momentum solver was utilized. For the learning rate, it is usually necessary to set a learning rate lower than that of the general training model when performing finetuning training. In this model, it was more appropriate to set the initial learning rate to 0.0001. The mini-batch size is usually set to a power of two, which can make the GPU perform better [25]. In this study, the mini-batch size was set to 4, 8, 16, and 32, separately. The total number of iterations was set to 90 epochs (the process of training the complete data set once on the neural network was one epoch). In this study, the root mean square error (RMSE) and loss were selected to evaluate the quality of the prediction results, and loss is the square of RMSE. The smaller the RMSE or the loss function, the better the quality of the model. The calculation formula of RMSE was ( ) where n denotes the number of training samples, and k t and k y denote the target output and the forecast output of the training samples, respectively. Based on the aforementioned settings of the network model parameters, the spectrum diagrams of urban substation noise were input into the model for training. Ten trials were conducted for each experiment to reduce the impact of randomness. During the training period, the changing curves of the average RMSE and loss of the training set and the test set in each epoch, with different mini-batch size values, are shown in Figure 4 (taken as the average of ten trials). As shown in Figure 4, as the number of iterations increased, the RMSE and loss of the training set and the test set generally decreased. After 90 epochs, all data sets had reached convergence, and the RMSE of all test sets was not In this study, a single GPU of Nvidia GeForce 940MX was used to run the MATLAB 2020b environment, and the built-in stochastic gradient descent method (SGDM) with a first-order momentum solver was utilized. For the learning rate, it is usually necessary to set a learning rate lower than that of the general training model when performing finetuning training. In this model, it was more appropriate to set the initial learning rate to 0.0001. The mini-batch size is usually set to a power of two, which can make the GPU perform better [25]. In this study, the mini-batch size was set to 4, 8, 16, and 32, separately. The total number of iterations was set to 90 epochs (the process of training the complete data set once on the neural network was one epoch).
In this study, the root mean square error (RMSE) and loss were selected to evaluate the quality of the prediction results, and loss is the square of RMSE. The smaller the RMSE or the loss function, the better the quality of the model. The calculation formula of RMSE was where n denotes the number of training samples, and t k and y k denote the target output and the forecast output of the training samples, respectively. Based on the aforementioned settings of the network model parameters, the spectrum diagrams of urban substation noise were input into the model for training. Ten trials were conducted for each experiment to reduce the impact of randomness. During the training period, the changing curves of the average RMSE and loss of the training set and the test set in each epoch, with different mini-batch size values, are shown in Figure 4 (taken as the average of ten trials). As shown in Figure 4, as the number of iterations increased, the RMSE and loss of the training set and the test set generally decreased. After 90 epochs, all data sets had reached convergence, and the RMSE of all test sets was not higher than 0.355, and the loss was not higher than 0.067, which indicates that the convolutional neural network The errors between the subjective score values and the predicted values of the eight verification samples not involved in the modeling, when the mini-batch size was 4, 8, 16, and 32, are shown in Figure 5. As shown in Figure 5, when the mini-batch size values were different, although the convolutional neural network model had various errors for verification samples, the trend of prediction errors was the same. At the same time, when the mini-batch size was four, the prediction error was the smallest, and the accuracy of the convolutional neural network model was the highest, so the mini-batch size selected was four.
Energies 2022, 15, x FOR PEER REVIEW 6 of 10 higher than 0.355, and the loss was not higher than 0.067, which indicates that the convolutional neural network based on AlexNet transfer learning can achieve high accuracy when used to predict the noise annoyance of urban substations. The errors between the subjective score values and the predicted values of the eight verification samples not involved in the modeling, when the mini-batch size was 4, 8, 16, and 32, are shown in Figure 5. As shown in Figure 5, when the mini-batch size values were different, although the convolutional neural network model had various errors for verification samples, the trend of prediction errors was the same. At the same time, when the mini-batch size was four, the prediction error was the smallest, and the accuracy of the convolutional neural network model was the highest, so the mini-batch size selected was four.

Discussion
The multiple linear regression model established in a prior study [12] was used to predict the annoyance level of the eight verification noise samples, and the results were compared with the subjective evaluation results. The multiple linear regression model is based on the least squares method, which takes the objective parameters that are highly correlated with the subjective evaluation results as independent variables, and the subjective annoyance level as dependent variables. Using multiple linear stepwise regression, the independent variables are introduced into the regression model one by one according to the significance of their influence on the dependent variables, and the original independent variables that become irrelevant to the model due to the introduction of new variables are deleted. Therefore, the final regression model only includes variables that ha a significant impact on the dependent variable. The multiple linear regression model is shown in Formula (2).
where P , 600 L , and A L denote subjective annoyance, the 600 Hz AWSPL, and the total AWSPL, respectively. The comparison of the prediction errors of the two models is shown in Figure 6. As shown in Figure 6, the prediction errors of the multiple linear regression model for the noise annoyance of urban substations are between 1% and 9%, but the prediction errors

Discussion
The multiple linear regression model established in a prior study [12] was used to predict the annoyance level of the eight verification noise samples, and the results were compared with the subjective evaluation results. The multiple linear regression model is based on the least squares method, which takes the objective parameters that are highly correlated with the subjective evaluation results as independent variables, and the subjective annoyance level as dependent variables. Using multiple linear stepwise regression, the independent variables are introduced into the regression model one by one according to the significance of their influence on the dependent variables, and the original independent variables that become irrelevant to the model due to the introduction of new variables are deleted. Therefore, the final regression model only includes variables that ha a significant impact on the dependent variable. The multiple linear regression model is shown in Formula (2).
where P , 600 L , and A L denote subjective annoyance, the 600 Hz AWSPL, and the total AWSPL, respectively. The comparison of the prediction errors of the two models is shown in Figure 6. As shown in Figure 6, the prediction errors of the multiple linear regression model for the noise annoyance of urban substations are between 1% and 9%, but the prediction errors

Discussion
The multiple linear regression model established in a prior study [12] was used to predict the annoyance level of the eight verification noise samples, and the results were compared with the subjective evaluation results. The multiple linear regression model is based on the least squares method, which takes the objective parameters that are highly correlated with the subjective evaluation results as independent variables, and the subjective annoyance level as dependent variables. Using multiple linear stepwise regression, the independent variables are introduced into the regression model one by one according to the significance of their influence on the dependent variables, and the original independent variables that become irrelevant to the model due to the introduction of new variables are deleted. Therefore, the final regression model only includes variables that ha a significant impact on the dependent variable. The multiple linear regression model is shown in Formula (2). P = −3.354 + 0.035L 600 + 0.089L A (2) where P, L 600 , and L A denote subjective annoyance, the 600 Hz AWSPL, and the total AWSPL, respectively. The comparison of the prediction errors of the two models is shown in Figure 6. As shown in Figure 6, the prediction errors of the multiple linear regression model for the noise annoyance of urban substations are between 1% and 9%, but the prediction errors of of the convolutional neural network model are between 2% and 7%. Using the latter model, a relatively ideal and robust result was obtained. The mean absolute error and the sum of squared errors of the prediction errors of the two models are shown in Table 1. The calculation formulas for the mean absolute error (MAE) and the sum of squared errors (SSE) are, respectively: where n denotes the number of verification noise samples and k t and k y denote the target output and the forecast output of the verification noise samples, respectively. As shown in Table 1, the MAE and the SSE of the convolutional neural network model are significantly lower than those of the multiple linear regression model, indicating that the convolutional neural network model has higher prediction accuracy.

Conclusions
This paper presents a convolutional neural network model for noise annoyance prediction, based on AlexNet transfer learning, to predict the noise annoyance level of urban substations. The frequency spectra of the noise samples were taken as the input for the model and the sample features were automatically extracted during the training process. The subjective evaluation results were the output of the model. The epoch was fixed, and the mini-batch size was set to 4, 8, 16, and 32, separately. Finally, all the data sets were converged. The root mean square errors of all test sets were not higher than 0.355, and the losses were not higher than 0.067. Compared to the prediction results of the well-established multiple linear regression model, the convolutional neural network model has higher prediction accuracy and robustness, with the prediction error falling between 2% The mean absolute error and the sum of squared errors of the prediction errors of the two models are shown in Table 1. The calculation formulas for the mean absolute error (MAE) and the sum of squared errors (SSE) are, respectively: where n denotes the number of verification noise samples and t k and y k denote the target output and the forecast output of the verification noise samples, respectively. As shown in Table 1, the MAE and the SSE of the convolutional neural network model are significantly lower than those of the multiple linear regression model, indicating that the convolutional neural network model has higher prediction accuracy.

Conclusions
This paper presents a convolutional neural network model for noise annoyance prediction, based on AlexNet transfer learning, to predict the noise annoyance level of urban substations. The frequency spectra of the noise samples were taken as the input for the model and the sample features were automatically extracted during the training process. The subjective evaluation results were the output of the model. The epoch was fixed, and the mini-batch size was set to 4, 8, 16, and 32, separately. Finally, all the data sets were converged. The root mean square errors of all test sets were not higher than 0.355, and the losses were not higher than 0.067. Compared to the prediction results of the wellestablished multiple linear regression model, the convolutional neural network model has higher prediction accuracy and robustness, with the prediction error falling between 2% and 7%. Thus, the network comprehensively reflects the objective metrics affecting