Classiﬁcation of Inter-Floor Noise Type/Position Via Convolutional Neural Network-Based Supervised Learning †

This article a re-written and extended version of “Classiﬁcation of noise between ﬂoors in a building using pre-trained deep convolutional neural networks” presented at on Acoustic Abstract: Inter-ﬂoor noise, i.e., noise transmitted from one ﬂoor to another ﬂoor through walls or ceilings in an apartment building or an ofﬁce of a multi-layered structure, causes serious social problems in South Korea. Notably, inaccurate identiﬁcation of the noise type and position by human hearing intensiﬁes the conﬂicts between residents of apartment buildings. In this study, we propose a robust approach using deep convolutional neural networks (CNNs) to learn and identify the type and position of inter-ﬂoor noise. Using a single mobile device, we collected nearly 2000 inter-ﬂoor noise events that contain 5 types of inter-ﬂoor noises generated at 9 different positions on three ﬂoors in a Seoul National University campus building. Based on pre-trained CNN models designed and evaluated separately for type and position classiﬁcation, we achieved type and position classiﬁcation accuracy of 99.5% and 95.3%, respectively in validation datasets. In addition, the robustness of noise type classiﬁcation with the model was checked against a new test dataset. This new dataset was generated in the building and contains 2 types of inter-ﬂoor noises at 10 new positions. The approximate positions of inter-ﬂoor noises in the new dataset with respect to the learned positions are presented.


Research Motivation
In apartment buildings, noise generated by residents or home appliances, called inter-floor noise, travels through structures and annoys residents on other floors [1][2][3][4]. In Korea, many people are exposed to inter-floor noise. Approximately 51.3% of residential buildings are multi-dwelling units, and 72.7% of them have at least five floors [5]; these buildings are classified as apartment buildings. Accordingly, the number of civil complaints of inter-floor noises has become a serious social issue in Korea [6]. To mediate disputes between residents in apartment buildings, the Korean government established the Floor Noise Management Center in 2012. The center is controlled by the Korea Environment Corporation under the Ministry of Environment. From 2012 to March 2018, the center received 119,500 complaints of inter-floor noise and visited 28.1% of victims to identify inter-floor noise [7].
It is difficult for humans to precisely identify the type/position of inter-floor noise, and some conflicts between residents have originated from this incorrect estimation of noise type/position. If possible, correctly identifying the type/position of inter-floor noise with a personal device can help prevent or reduce conflicts. Moreover, correctly identifying inter-floor noise type/position is considered to be the first step in solving the noise problem.

Related Literature
Related research in the literature [8][9][10] on indoor footstep localization is in the early stages, and a small number of reports are available [8]. These prior publications were motivated by indoor occupant localization for energy management, understanding occupant behavior, facility security, and other smart building applications. In the literature, a wave induced by a footstep in a building structure is considered to be a dispersive plate wave with signal distortion. Bahroun et al. [9] introduced the perceived propagation velocity, which reflects the dispersive nature of a wave propagating through a concrete slab. It was shown through an experiment that the perceived propagation velocity decreases as the wave propagates away from the source, which limits the performance of algorithms based on time difference of arrival (TDOA). To overcome this, a localization method that uses only the sign of the measured TDOA was proposed. Poston et al. [8] proposed a footstep localization method with enhanced TDOA, where two footstep-to-sensor interaction types were identified. Mirshekari et al. [10] used a decomposition-based dispersion mitigation technique to overcome the dispersive nature and enhance the localization performance. Footstep-induced vibrations were measured using accelerometers [8,9] or geophones [10] with multiple channels. Other related research is discussed in [11,12].

Research Approach
In this paper, we introduce an inter-floor noise type/position classification algorithm based on supervised learning. This method classifies a given inter-floor noise into a type and position. An intuitive explanation of the method is as follows: (1) inter-floor noises are generated and recorded with a single microphone, (2) the dataset is labeled and converted to time-frequency (TF) patches, (3) a convolutional neural network (CNN)-based model is trained on the TF patches, and (4) a new inter-floor noise is classified into a noise type/position with the trained model.
One unique feature of the method is that inter-floor noise measured with a single microphone is used to determine the position of the noise source rather than using signals over accelerometers or geophones with multiple channels. Instead of estimating the TDOA between multiple accelerometers or geophones, a signal over a single microphone with a sufficient duration is converted to a TF-patch and used as an input. The TF-patch might encode the dispersive nature of the plate wave, e.g., a change in propagation velocity or other unidentified features. At this point, the method can be considered to learn the response of a building observed with a single sensor for a given excitation [8].
Because this approach determines the type/position of a given inter-floor noise using a signal over a single microphone via supervised learning, there are some inherent limitations. Although sub-meter localization accuracy shown in the literature might not be necessary for inter-floor noise type/position classification, using a single microphone might provide an estimated position of a noise source with lower accuracy than using multiple densely installed sensors. The machine learning approach also requires collecting data from every building structure.
Despite these limitations, our method has some clear advantages.
(1) A microphone embedded in a smartphone can be used for classifications. If the method is included in a mobile application, many people can use it to identify noises. (2) In the literature [8][9][10], only footstep-induced vibration is considered herein, whereas this method can be used to identify the type of noise, as shown in Sections 3.4 and 4.1. (3) Management and implementation of a single sensor is relatively easier than using multiple sensors.

Contributions of this Paper
The primary contributions of this paper can be summarized as follows. First, a dataset with 2950 inter-floor noises was built as part of this project. These noises can be classified into 5 noise types and 19 positions. Second, we propose a supervised learning-based inter-floor noise type/position classifier that uses a signal from a single microphone. Third, our method is an acoustic approach, whereas accelerometers or geophones were used in other studies.
The remainder of the paper is organized as follows. The inter-floor noise dataset is explained in Section 2. The inter-floor type/position classification via supervised learning is described in Section 3. the approach is evaluated on a newly generated inter-floor noise dataset in Section 4, and the paper is summarized in Section 5.

Inter-Floor Noise Dataset
An inter-floor noise dataset (SNU-B36-50) was built in our previous study on inter-floor noise type/position classification [13]. The inter-floor noise dataset is available at [14].

Selecting Type and Position of Noise Source
The inter-floor noise dataset was designed based on a report provided by the Floor Noise Management Center [7]. From 2012 to March 2018, the center received 119,500 complaints from residents suffering from inter-floor noise, and 28.1% of victims were visited to identify inter-floor noise. The identified noise types and their portions are footsteps (71.0%), hammering (3.90%), furniture (3.3%), home appliances (vacuum cleaner, laundry machines, and television) (3.3%), doors (2.0%), and so on. Unidentified or unrecorded noise types account for 10.1%. 79.4% of the identified inter-floor noises originated from residents on the upper floor, while 16.3% of the identified inter-floor noises originated from residents on the lower floor [7].
Based on the report, the top 4 noise types occupying 92.9% of the identified noise types were selected for classification. Because 95.7% of inter-floor noises were identified as noises from the upper floor and the lower floor, we focused on inter-floor noises generated on the two adjacent floors. In addition to these, inter-floor noises on the middle floor were also collected to check whether our model can distinguish noises generated on the same floor from noises generated on other floors. Figure 1 shows 5 inter-floor noise types included in the dataset. They are a medicine ball falling to the floor from a height of 1.2 m (MB), a hammer dropped from 1.2 m above the floor (HD), hammering (HH), dragging a chair (CD), and running a vacuum cleaner (VC). Generating reproducible footstep noises is very challenging. Furthermore, it could hurt the person who generates them. Thus, an impact ball, a bang machine (a tire), or a tapping machine was usually used to produce impulsive footstep noise dominated by low-frequency components [15]. In the dataset, a 2 kg, 0.2 m diameter medicine ball, which generates noises with a frequency characteristic similar to that generated when an impact ball is dropped, was used to mimic footsteps. Because it is difficult to transport and install a laundry machine and television, only a vacuum cleaner was used to generate noise from home appliances.

Generating and Collecting Inter-Floor Noise
The inter-floor noise dataset was collected in building 36 at Seoul National University. The building is reinforced concrete frame structure. From statistics [16,17], this structure is the most widely used in modern buildings in South Korea. Our experimental building is partitioned with concrete walls. The corridor where the noises were generated is a slab covered by concrete terrazzo tiles. Figure 2 illustrates the noise source and receiver arrangement for generating and recording inter-floor noise. The 9 solid circles indicate selected positions for noise sources. The noise sources are separated by 6 m along the X axis. The notation around the noise sources indicate the floor and distance along the X axis, i.e., 2F0m denotes that position of a noise source is on the second floor and the distance from the origin along the X axis is 0 m. A single microphone in a smartphone (Samsung Galaxy S6 [18]) was used as a receiver to record inter-floor noises. The receiver was installed on 2 F to receive inter-floor noises from the upper and lower floors. The solid square in Figure 2 indicates the position of the receiver. Photo of the corridor and the location of the receiver are shown in Figure 3. Inter-floor noises were sampled at f s = 44,100 Hz for approximately 5 s. 50 inter-floor noises were generated at each position and for each noise type. Because VC on 1 F and 3 F were barely audible at the receiver position, VC was generated only on 2 F. Table 1 shows the number of data in the inter-floor noise dataset. One distinguishing feature of the dataset is that each inter-floor noise can be labeled as a noise type and a position of the noise source. Thus, this dataset can be used to learn the type and position of various noise sources in a building.

Supervised Learning of Inter-Floor Noises
In this section, we present a supervised learning method for inter-floor noise type/position classification using a CNN.

Convolutional Neural Networks for Acoustic Scene Classification
The superiority of using a CNN for image classification has already been discussed in the ImageNet Large Scale Visual Recognition Competition (ILSVRC) [19]. Following the successful use of a CNN for image classification, CNNs were also used for acoustic scene classification (ASC). CNN architectures used for ASC are fundamentally the same as those used in image classification. Audio samples were converted to 2-dimensional features and fed into the input instead of images.
The uses of CNNs for ASC are mostly focused on environmental sound classification [20,21] and automatic species classification of animal life [22,23]. These two areas were also the main interests in the Detection and Classification of Acoustic Scenes and Events challenge (DCASE) 2018 [24]. We extend these areas of CNNs used to inter-floor noise type/position classification.  In [25], state-of-the-art CNNs for image classification in ILSVRC are examined and applied to audio classification. The models employed in the paper are AlexNet [26], VGGNet [27], and ResNet [28]. AlexNet was the first CNN-based winner in ILSVRC. VGGNet showed better image classification performance with the ImageNet dataset than AlexNet in ILSVRC 2014. ResNet was the winner of ILSVRC 2015 and showed better image classification performance than the other two models. VGGNet and ResNet can be classified based on the number of layers (depth) of their neural networks. VGG16 and ResNet V1 50, which are considered basic models from each group, were used in this study.

Network Architecture
The inter-floor noises are converted to log-scaled Mel-spectrograms using LibROSA [29] to represent audio samples in 2 dimensions with size of H × W, where H and W denote the height and the width, respectively. In the literature [20,21,25,[30][31][32], a spectrogram, log-scaled Mel-spectrogram, and Mel-frequency cepstral coefficient are used as input features in ASC. They represent a signal in the TF domain and are acceptable CNN inputs. In our previous study [13], using a log-scaled Mel-spectrogram provided the best performance with the inter-floor noise dataset. Thus, we used a log-scaled Mel-spectrogram in this study.
A log-scaled Mel-spectrogram P is obtained through the following steps. First, a signal s ∈ R l is extracted from an audio clip in the inter-floor noise dataset. The sample length l is set to 132, 300 samples (3 s in length). Second, s is converted to a magnitude of short time Fourier transforms S ∈ R 1025×W using a 2048 points fast Fourier transform with window size of l/W and the same hop size, where · rounds to the next largest integer. Third, a Mel-spectrogram is obtained using where F ∈ R H×1025 is a Mel-filter bank with frequency range of [0, f s /2]. This converts frequency scale of S to the Mel-scale. Finally, M is converted to a log-scaled Mel-spectrogram where m is the largest element in M. H × W is given from the input size W × H × 3 for a CNN. For example, VGG16 has input size of 224 × 224 × 3. Since the three CNNs (AlexNet, VGG16, and ResNet V1 50) are designed for image classification, they have 3 input channels (BGR). To take advantage of the knowledge learned from a large dataset, the weights between the three input channels and the following layer are preserved instead of modifying the input channels to a single channel. Consequently, P is supplied to all the channels. Potential of this method against the inter-floor noise dataset was shown in [13] via performance comparison between a CNN with one input channel and VGG16 without knowledge transfer. The three CNNs contain millions of learnable weights. Given the sparsity of the inter-floor noise dataset, it is difficult to train models with many weights. In such conditions, a CNN can be trained using transfer learning, as suggested in [33]. Transfer learning initializes a CNN with weights that are pre-trained on a large dataset (source) and fine-tunes them on a target dataset. The method assumes that the internal layers of a CNN can extract mid-level descriptors from a source that explain the distribution of the source. The distribution of a target dataset can be learned by sharing the mid-level descriptors [33,34]. In this study, a large image dataset is used as source for training a CNN via transfer learning. Usually, image and sound representations are considered different. However, low-level notions of images [34] such as edges and shapes can be found in P and changes of lighting can be comparable to acoustic pressure changes in P. Hence, P can be considered to be an image with size of H × W and the descriptors of the source could be shared to learn the distribution of the target. This approach for ASC can be found in [35,36].
The weights of the three CNNs are initialized using weights that are pre-trained on ImageNet (source). The pre-trained weights used in this study are from [37] for AlexNet and VGG16, and from [38] for ResNet V1 50. Because the three CNNs are designed for ILSVRC, each output from the three CNNs have size of I = 1000. The adaptation layer reduces the number of output dimensions in the lower layer to C, where C is the number of the inter-floor noise types/positions. The weights between the output of the CNNs and the adaptation layer are randomly drawn from a normal distribution with a fixed standard deviation of 0.01. This initialization is used in [26,27,39]. The bias b ∈ R C is initialized with 1, as in [26].
In a classification problem, the output values are normalized using a softmax function (i.e., softmax classifier) to convert the output elements to pseudo probabilities. For a given x ∈ R n , the softmax function is defined as Let z c be a c-th output node of the adaptation layer where w i,c is a weight between the i-th node of the former layer σ i and c-th node of the adaptation layer. The predicted probabilities of given inter-floor noises with C type/position categories arê y c = softmax (z) c . The loss function L selected for optimization of three CNNs is cross entropy loss and L 2 -regularization of w i,c where λ is a regularization strength and y ∈ R C is a one-hot encoded true label.

Evaluation
The performance of the three CNNs was evaluated using 5-fold cross validation. 5-fold cross validation divides the inter-floor noise dataset into 5 subsets of equal size. A model is optimized on a zero-centered training set Q (train) , which is composed of the 4 subsets. The optimized model is validated on the remaining set Q (valid) , which is zero centered on the mean value of the training sets. These steps are repeated for the entire validation set.
Performance of a CNN model M on a dataset can be measured by finding the minimum of where Λ * is the optimal hyperparameter pair [40]. M with smaller ψ provides better performance. Λ * is composed of the optimal regularization strength λ * and the optimal learning rate η * . The size of the hidden layer, the number of hidden units, and an activation function are not considered to be they are already decided by M. Λ * was estimated via random search as introduced in [40] with 30 epochs. λ i and η i (i = 1, · · · , 100) are generated log-uniformly from 10 −4 to 10 2 . The weights of the CNN are optimized via mini-batch gradient descent (GD) to minimize L. The mini-batch size was set to 39. The CNNs and optimization were implemented with TensorFlow [38] and available at [41].
In the remaining subsections, inter-floor noises were classified into type/position using the three CNNs. The performance of the three CNNs were measured and compared. Type and position were separately considered in Sections 3.4 and 3.5, respectively.

Type Classification Results
The inter-floor noises in the dataset were labeled into the following (C = 5) type categories: MB, HD, HH, CD, and VC. The weights of the three CNNs with estimated optimal parameters MΛ type were optimized to minimize L using GD for 30 epochs. This process is sufficient to minimize ψ (valid) (Λ type ). Table 2 shows the accuracy of type classification of the three CNNs using 5-fold cross validation. The first column in the table shows the name of the CNNs, and the first row of the table shows the 5 categories. VGG16 outperforms ResNet V1 50 for all categories. The performance of VGG16 is the same as that of AlexNet for 3 categories (HD, HH, and VC) and is slightly better for the remaining categories. All types of inter-floor noises are classified correctly using VGG16 with less than 1% error.

Position Classification Results
The inter-floor noises in the dataset were labeled into the following (C = 9) position categories: 1F0m, 1F6m, 1F12m, 2F0m, 2F6m, 2F12m, 3F0m, 3F6m, and 3F12m. The weights of the three CNNs with estimated optimal parameters MΛ position were optimized to minimize L using GD for 50 epochs. This process is sufficient to minimize ψ (valid) (Λ position ).
The accuracy of position classification with the three CNNs was evaluated using 5-fold cross validation. The accuracies are arranged in Table 3. The first column of the table shows the name of the three CNNs, and the first row of the table shows the 9 categories. In position classification, VGG16 outperforms the other models, except for categories 2F and 3F0m. All models show comparatively poor performance for positions 1F0m and 1F6m. The models seem to confuse these two positions. The confusion in the position classification can be seen in Figure 5. If confusion between positions on the same floor are ignored, the classification accuracy with the three CNNs increases. Table 4 shows the floor classification accuracy with the three CNNs. VGG16 shows an accuracy of 99.5% for floor classification. Summarizing the results, our approach to type/position classification based on the three adapted CNNs with knowledge transfer were compared. The CNNs showed feasibility on type/position classification of inter-floor noises in the building. VGG16 showed the best performance on both type/position classification.

Type/Position Classification of Inter-Floor Noises Generated on Unlearned Positions
Inter-floor noises generated from 10 new positions in the same building are evaluated in this section. Their type/position are classified using VGG16, which was found to be the best performing model, as shown in Sections 3.4 and 3.5. Through these steps, we attempt to address two questions related to the robustness of the model: (1) Can the model classify the type of the new inter-floor noise data correctly? (2) If an inter-floor noise is generated from an unlearned position, can the model classify the new data near the learned position?
These questions were investigated with a sparse dataset containing 1000 newly generated inter-floor noises. Figure 6 illustrates the noise source and receiver arrangement for generating and recording new inter-floor noises. The 10 circles in the grid pattern show the selected positions. The noise sources are 1 m apart along the X axis on 3 F, excluding the learned positions 3F0m, 3F6m, and 3F12m. MB and HH were selected and noise sources were finely positioned on 3 F. MB and HH are the top 2 noise types among the identified inter-floor noise types and 79.4% of noise source positions were identified on 3 F by the Floor Noise Management Center [7]. Table 5 shows the number of data in the new inter-floor noise dataset. This dataset is available at [42].

Type Classification of Inter-Floor Noises Generated from Unlearned Positions
To address the first question in Section 4, MΛ type ,VGG16 the optimized VGG16 with P ∈ Q (train) using noise type labels in Section 3.4 was evaluated against the new dataset. Because type classification with the model was evaluated using 5-fold cross validation in Section 3.4, there exists a model for each fold. Type classification accuracy was measured as follows: (1) The new inter-floor noise dataset with noise type labels was converted to P ∈ Q (test) . (2) P ∈ Q (test) zero centered on the mean value of P ∈ Q (train) was classified into C = 5 type categories using MΛ type ,VGG16 for all models. (3) The average type classification accuracy was calculated. Table 6 shows a confusion matrix drawn with the type classification results. The confusion matrix visualizes how given data with true labels is predicted. Comparison of the confusion matrix and Table 2 addresses the robustness of MΛ type ,VGG16 for type classification with limited position changes on 3 F.

Position Classification of Inter-Floor Noises Generated from Unlearned Positions
To address the second question in Section 4, MΛ position ,VGG16 the optimized VGG16 with P ∈ Q (train) using the position labels in Section 3.5 was evaluated on the new dataset. Because position classification with the model was evaluated using 5-fold cross validation in Section 3.5, there exists a model for each fold. Q (test) and Q (train) are mutually exclusive in the position domain, thus position classification accuracies cannot be quantified, but position classification results of P ∈ Q (test) into the learned 9 positions can be visualized. Position classification was evaluated as follows: (1) The new inter-floor noise dataset with position labels was converted to P ∈ Q (test) . (2) P ∈ Q (test) zero centered on the mean value of P ∈ Q (train) was classified into C = 9 position categories using MΛ position ,VGG16 for all models. (3) Average position classification results were calculated and drawn in a confusion matrix. An average can be used to check the concentration of classification results into the three learned positions on 3 F. Figure 7 shows the confusion matrix for position classification results. The vertical axis shows position labels of P ∈ Q (test) , and the horizontal axis shows position labels of P ∈ Q (train) . The value in each cell is an average of the corresponding classification results. Although the position classification results cannot be quantified, floor classification results can be quantified from the confusion matrix.

Summary and Future Study
A CNN-based supervised learning method for classifying inter-floor noise types and positions is proposed in this paper. An inter-floor noise dataset was used to evaluate the proposed method. The dataset was built based on a report by the Floor Management Center. State-of-the-art CNNs for image classification (AlexNet, VGG16, and ResNet V1 50) were adapted and used for noise classification. The inter-floor noises were converted to log-scaled Mel-spectrograms and used as the input to the CNNs. The weights of each CNN were initialized with weights that were pre-trained on ImageNet. The hyperparameters in each CNN were optimized using random search. The CNN with the optimal hyperparameters was evaluated using 5-fold cross validation. VGG16 shows the best type classification performance with 99.5% accuracy and the best position classification performance with 95.3% accuracy for 9 positions included in the dataset.
To evaluate the robustness of the VGG16-based inter-floor noise type/position classification, the trained models were tested with the newly gathered dataset. The newly gathered dataset is composed of inter-floor noises generated at the unlearned positions near the learned positions. The VGG16-based model showed robust type classification against position change with 98.5% accuracy. The newly gathered dataset was classified into the learned position near their actual positions.
In summary, we presented a feasibility study of a convolutional neural network-based classifier for inter-floor noise type and position in a single building. Future study should focus on the generalizability of our results through the evaluation against new inter-floor noise datasets including diverse noise types. For example, HH can be generated with a new hammer. Also, inter-floor noises in other buildings can be gathered to evaluate the generality of the type/position classification. These can be improved in our future study and these improvements can contribute to other single sensor-based approaches for their evaluations.

Conflicts of Interest:
The authors declare no conflict of interest.