Source Type Classification and Localization of Inter-Floor Noise with a Single Sensor and Knowledge Transfer between Reinforced Concrete Buildings

A convolutional neural network (CNN)-based inter-floor noise source type classifier and locator with input from a single microphone was proposed in [Appl. Sci. 9, 3735 (2019)] and validated in a campus building experiment. In this study, the following extensions are presented: (1) data collections of nearly 4700 inter-floor noise events that contain the same noise types as those in the previous work at source positions on the floors above/below in two actual apartment buildings with spatial diversity, (2) the CNN-based method for source type classification and localization of inter-floor noise samples in apartment buildings, (3) the limitations of the method as verified through several tasks considering actual application scenarios, and (4) source type and localization knowledge transfer between the two apartment buildings. These results reveal the generalizability of the CNN-based method to inter-floor noise classification and the feasibility of classification knowledge transfer between residential buildings. The use of a short and early part of event signal is shown as an important factor for localization knowledge transfer.


Motivation
In multi-dwelling units, noises generated by occupants propagate through the structures and exert an unpleasant effect on neighboring occupants [1][2][3][4], which is a serious problem in major cities in Korea, where most residential buildings are multi-dwelling units [5]. For example, 62% of the residential buildings of South Korea are classified as apartment buildings of more than five storeys [6]. Accordingly, the Floor Noise Neighborhood Center of the Korea Environment Corporation [7], affiliated with the Ministry of Environment, received 152,061 complaints about inter-floor noise from 2012 to 2019 [8].
It is challenging to identify the inter-floor noise traveling through multi-storey residential buildings owing to the human ears' failures to intercept these sounds. Incorrect identification of inter-floor noise by human hearing often causes conflicts among occupants. Some conflicts originated from wrong identification of inter-floor noise. In one case, the victim complains to an occupant who did not generate any inter-floor noise. In other case, the person who made the noise pretends not to know about it and ignores the victim's complaints. For both cases, technical identification of inter-floor noise can provide a proper basis for settling the dispute. Human ears process and discriminate sound accurately, but identification of inter-floor noise via machine could provide less biased results. In the authors' previous studies [9,10], a single microphone-based inter-floor noise source type classifier/locator was proposed to assist the identification problem using convolutional neural network (CNN)-based supervised learning, rather than approaches with multiple channels of accelerometers [11,12] or geophones [13] networks. The method was verified on an actual dataset obtained from a campus building. This method can be implemented in a personal mobile phone device and constructs a data-driven model that helps reduce failure of noise identification by the human ears with less human bias and provide a proper basis for settlement in the case the offender disregards the complaints. However, validation of the generalizability of the method in actual residential buildings was left for future study [10]. Therefore, the method needs to be verified for many scenarios and to determine its limitations.

Related Literature
Sound classification, which deals with tasks that are similar to the source type classification in this study, has been studied in acoustic scene classification (ASC) fields [14,15]. The conventional data-driven methods for ASC extract features from audio waveforms adopting Mel-filterbanks, Mel-frequency cepstral coefficient (MFCC), or principal component analysis (PCA) and classify the extracted features into a category via majority voting or support vector machine (SVM) [16][17][18]. Recently proposed methods follow the deep neural network (DNN)-based scheme after adoption of CNN [19].
In the conventional acoustic source localization, a method based on the triangulation technique (Tobias algorithm) obtained analytical solutions using multiple channels of sensors on known environmental properties, such as positions of sensors and sound speed [20]. The triangulation technique typically used the time of arrival (TOA) and the group velocity of the direct wave [21]. A pair of TOAs, the absolute time instants when a transmitted signal is detected by each sensor, provides three equations of the circle in two dimensions. Solution for source position exists in the form of the intersection of the three circles. Complementary triangulation techniques were introduced to minimize the effect of the dispersive guided waves. They were realized by optimizing the error function [22] or combining the continuous wavelet transform, Newton's method, and the line search algorithm [23]. A model-based impact locator with a single sensor was proposed and validated in a plate structure [24].
Learning-based methods are possible alternatives to the model-based methods. These approaches learn the relationship between the given source positions and signals traveled through a structure. Grabec et al. [25] adopted the adapted learning and series of reference signals with position information to localize sources. Kosel et al. [26] adopted the responses of discrete sources to train the locator. As an application to the human-device interface, Ing et al. [27] introduced the time-reversal process for localizing the impact of a finger on a plate. Ciampa et al. [21] demonstrated the feasibility of adopting the time-reversal process for localization in a composite structure. Ruiz et al. [28] proposed an impact localization method based on projections to latent structures. Although these methods require data gathering, they do not need knowledge on sound speed of the medium or receiver position.
Neural network (NN) is an important tool for approximating the complex relationship between response signals and their source types/positions. In [29], the localization of impact and damage detection was demonstrated simultaneously using a multilayer perceptron. In [30], a CNN-based approach was introduced to localize acoustic sources in a plate with rivet-connected stiffeners. Notably, a few previous studies [21,27,30] adopted a single sensor and demonstrated the feasibility of single sensor-based localization. However, these approaches were analyzed in plate-like or simplified composite plates, which are considered to be simpler structures than those in real life.
For practical application, it is important to elucidate the feasibility of the algorithm using actual datasets. Previous studies on single sensor-based localization in real-world applications have been conducted. A previous study [31] demonstrated localization in a room via echo labeling. Two other studies [32,33] synthesized training data via model-based simulations to train their CNNs. The range between a source and single hydrophone on actual sea trial data were presented. In addition, the former [32] verified the practicality of depth estimation, whereas the later [33] focused on classifying ocean bottom type.
Localization in actual buildings has been studied in indoor occupant localization fields. Considering the dispersive nature of waves in a plate and sign of the measured time differences of arrival (SO-TDOA), Bahroun et al. [11] introduced a foot step localization technique in a damped and dispersive media. Poston et al. [12] proposed a footstep type aware footstep localization technique, which identifies a given footstep type (compression or non-compression) and applies type-wise localization algorithm. In addition, a tracking algorithm for moving occupants on linear trajectories was studied [34]. Mirshekari et al. [13] demonstrated the limitation of signal distortion to enhance the TDOA estimation using wavelet transform for the localization of footstep-induced vibrations. Woolard [35] studied a learning-based event localization in a hallway and on stairs in a campus building using a nearest neighbor algorithm with three accelerometers. In addition, the feasibility of a single sensor-based direction of arrival estimation on a beam was demonstrated via simulation.

Approach
This study presents the CNN-based source type classifier and locator with a single microphone on inter-floor noise data obtained from two actual apartment buildings to verify the generalizability of the method, which considered important for data-driven approach. In addition, the feasibility to learned source type and localization knowledge transfer between the similar reinforced concrete buildings is presented. This approach significantly relies on deep learning to formulate the source type classifier and locator for air-concrete-steel mixed environments, where building properties are insufficiently known, as well as for structure with high structural complexity in the acoustic medium. Similar to the learning-based method proposed in the previous studies [9,10], it learns responses with source type/position labels transmitted from discrete positions in the buildings to formulate the data-driven identification of inter-floor noise in reinforced concrete building using a single microphone, thereby extending the application of deep learning. Accordingly, inter-floor noise was obtained from two actual apartment buildings. The data points selected were on slabs of allowed rooms for experiments, whereas the campus building dataset (SNU-B36-50E [10]) in the previous study includes those on corridor slabs in the two-dimensional spaces alone. Several inter-floor noise identification and knowledge transfer tasks were conducted on the new dataset to demonstrate the generalizability, and to elucidate the limitations and uncertainties of the method.

Contributions
The contributions of this study are summarized as follows. (1) Inter-floor noise datasets were built with noise samples obtained from two actual reinforced concrete apartment buildings to study data-driven source type classification and localization in actual reinforced concrete buildings. (2) The CNN-based source type classification and localization with a single microphone was demonstrated via several tasks on the datasets. In addition, the limitations of this approach were discussed. (3) The feasibility of the learned source type and localization knowledge transfer between the apartment buildings was demonstrated. Provided the source type and localization knowledge of trained samples can be reused for tasks, without required training or even under data sparsity, the noise identification method can be used widely. (4) It was empirically shown that using a short and early parts of an inter-floor noise signal is effective for localization knowledge transfer between the buildings.
The remainder of the paper is organized as follows. The apartment building inter-floor noise datasets are explained in Section 2. An onset detection is described in Section 3 that finds the event start position of an inter-floor noise signal to reduce human effort to achieve visual annotation of the event. Several tasks for verifying the source type classifier, locator, and knowledge transfer between two apartment buildings are prepared. The measured performance of the approach is reported and discussed in Section 4. Finally, the paper is summarized in Section 5.

Apartment Building Inter-Floor Noise Datasets
The two datasets adopted in this study contain inter-floor noise recorded using a single microphone in two actual apartment buildings. They are designed to study the CNNbased source type classification and localization in apartment buildings. These extend the dataset obtained from the campus building in the previous study [10]. The data points were selected to simulate inter-floor noises on the floor above/below based on the noise statistics [4], which provides the main unpleasant noise types and source positions to occupants. The key purposes of the dataset are for verifying (1) the generalizability of CNN for source type classification and localization on inter-floor noise in actual reinforced concrete buildings not only in the campus building as exhibited in the previous work [10]; (2) source type classification and localization of inter-floor noise transmitted through unlearned floor sections and from unlearned positions, which can be seen as knowledge transfer within a single building; and (3) source type and localization knowledge transfer between two similar reinforced concrete buildings.
Selecting source type and position of inter-floor noise for data construction was discussed sufficiently in the previous study [10] based on the noise statistics [4]. The statistics provides the source types and positions of the identified inter-floor noises from the analysis of the 119,500 complaints investigated by the center from 2012 to March of 2018. The identified source types and their contributions to inter-floor noise complaints are attributed to footsteps (71.0%), hammering (3.90%), furniture (3.3%), home appliances (vacuum cleaner, laundry machines, and television) (3.3%), doors (2.0%), and unidentified or unrecorded sources (10.1%). Of the identified inter-floor noises, 79.4% were from the floor above and 16.3% were from the floor below. In other words, 95.7% of the complaints originated from inter-floor noises on the floors above/below.
Inter-floor noises were generated in two apartment buildings referring to the discussion. The inter-floor noise obtained from the two apartment buildings can be classified into five source types, as shown in Figure 1. They are the same source types as those included in the dataset from the campus building: a medicine ball falling to the floor from a height of 1.2 m (MB), a hammer dropped from a height of 1.2 m above the floor (HD), hammering (HH), dragging a chair (CD), and operating a vacuum cleaner (VC). The inter-floor noise generating procedures are the same as those in the previous work [10].  Such building structures are reported as the most widely used types for modern buildings in South Korea [36,37]. The slabs of the apartment buildings were covered with vinyl flooring. The construction details of APT I and APT II are a reinforced concrete wall and reinforced concrete masonry structure [38], respectively. The reinforced concrete wall structure withstands the load generated by its own weight by wall and has been the mainstream of modern apartment building construction in South Korea. There is a statistics provided by the Korean Ministry of Land, Infrastructure and Transport that 98.5% of new residential multi-dwelling units during 2007-2017 were constructed using this method [39]. The reinforced concrete masonry structure was usually adopted for the construction of low-rise buildings during the 1960s-1980s in South Korea.
Both APT I and APT II datasets are designed to obtain the five source types of interfloor noise from the floors above/below, similar to the campus building dataset. Interfloor noise was recorded using a smartphone [40] microphone with a sampling rate f s of 44,100 Hz. The duration of each recording is approximately 5 s, and each recording contains a single event. The height of the receiver was 1.5 m above the floor, as set for the campus building dataset. Both datasets could be split into a training/validation and test dataset. Obtaining a training/validation dataset from APT I was prepared as follows. The five source types were generated at 1-A and 1-B, and sampled with the receiver on the floor above/below as illustrated in Figure 2b. The data points 1-A and 1-B are the centers of the two spaces evenly dividing the room allowed for the experiment. Their positions relative to the receiver are labeled as where a and b represent the floors above and below relative to the receivers, respectively. VC from the floor below, i.e., VC from 3 F to 4 F, was not recorded, as this source type was barely audible from the floor above (4 F). For obtaining a test dataset from APT I, inter-floor noise was generated at 1-A , 1-B , 1-C, 1-D, and 1-E, as illustrated in Figure 2c.
The noise was sampled with the receiver on the floor below (3 F). 1-A and 1-B are at the same XY positions as those of 1-A and 1-B, respectively. Therefore, 1-A -a and 1-B -a can be considered the same as 1-A-a and 1-B-a from the viewpoint of the receivers. The five source types were generated at these two positions (1-A and 1-B ). In addition, MB and HH were generated at 1-C, 1-D, and 1-E, where these source types occupy large portion of the identified source types in the complaint analysis [4]. Their positions relative to the receiver position are labeled as Union of these two separately labeled data domains, D APT 1 = {X , Y APT 1 } and D APT 1 = {X , Y APT 1 }, are combined and represented as D APT I = {X , Y APT I }. Most multistorey residential buildings have almost the same structure for all floors. However, the deployment of goods, such as furniture can differ on each of the floor and act as uncertainties. The test dataset in D APT 1 = {X , Y APT 1 } can be adopted to verify the robustness of a source type classifier or locator against these scenarios.
The source types, except VC, were generated at all noise source positions in APT II. Obtaining a training/validation dataset (APT 2 dataset) from APT II was prepared as follows. The four source types were generated at 2-A, 2-B, 2-C, and 2-D and sampled with the receivers on the floor above/below as illustrated in Figure 3b. Their positions relative to the receiver position are labeled as Each data point is at the center of each room or space, e.g., 2-A and 2-C are at the center of the living room and bed room. For obtaining a test dataset (APT 2 dataset) from APT II, inter-floor noise was generated at 2-A , 2-B , 2-C , and 2-D , as illustrated in Figure 3c. They are at the same XY positions as those for the training/validation dataset. Their positions relative to the receiver position are labeled as APT I dataset contains noise samples generated only on the living room slabs. Approximately 50 inter-floor noise events were obtained for source type at each relative position in APT I. Inter-floor noise was generated on the bedroom slabs as well as on the living room slabs in APT II. Approximately, 60 inter-floor noise events were obtained for each source type at each relative position. APT I and APT II datasets were obtained in each building per day. The total numbers of the inter-floor noise events in APT I and APT II dataset are 1785 and 2880, respectively. The data points in APT II may have more generalized conditions than those in APT I, because they are distributed in the wider three-dimensional spaces.

Onset Detection
In the previous studies [9,10], a visually annotated inter-floor noise sample with a duration of 3 s was converted to a log-scaled Mel-spectrogram and classified into a source type and position category by a CNN-based classifier. However, the human visual annotation requires effort to annotate large amounts of data and knowledge.
In the seismic signal-processing field, automatic seismic event-detection algorithms have been developed to replace laborious visual detection by humans. Allen [41] developed an earthquake timing-detection algorithm using time averaging and zero-crossing rate measurement of signals over seismometers. Allen's algorithm was set as a baseline in many other onset-picking studies. A modified onset-picking method was developed in the acoustic emission field using the Akaike Information Criterion (AIC) [42], referring to Allen's algorithm and the extended work [43]. Applications of this approach in the acoustic emission field can be found [12,44]. A broad-band maximum-likelihood method to estimate parameters and detect seismic events was studied [45]. Higher-order statistics (HOS)-based onset-picking methods were compared with Allen's algorithm and analyst's picks [46,47]. These HOS-based methods were simple in implementation, showed similar results to analysts' picks, and demonstrated less detection failure under high noise levels than that of Allen's algorithm.
Most of the signals in the inter-floor noise datasets are impact signals whose duration times are short and show a drastic energy rise at the event start positions. In this study, an onset-detection method using kurtosis is employed to detect the onset of the interfloor noise signals. The kurtosis measures the heaviness of the tails representing the non-Gaussianity. This method was selected for the following reasons: (1) using a single HOS-property, kurtosis, is compact in implementation; (2) there exist very small differences between the onset detection results using skewness and kurtosis [46]; (3) the signals adopted in this study are obtained with a single sensor, so that exact onset-time picking is not required, unlike methods employing multiple channels of sensors to improve the performance of calculated properties, e.g., TDOA; and (4) CNN comprises shift-invariance in time axis in its input [48]. Equation (5) estimates the kurtosis of the M-sample sliding window, which returns a maximum value ofγ 4 (k), (k = M, M + 1, · · · , L) at an onset position for a given L-sample signal vector s(m) [47].
A confidence interval is usually considered to bound and ensure the probability of the signal's existence [47], which may be required in actual applications to reduce false alarms. However, this was not considered in this study, because the audio clips in the datasets do contain a single event such that a position with the maximumγ 4 (k) can be assumed as an onset. The window size M is usually selected to achieve the minimum difference between the analyst's pick and an estimated onset position. M = 3000 is sufficient to find an onset position.

Convolutional Neural Network-Based Classifier
The CNN-based classifiers designed for image recognition already demonstrated source type classification and localization in the previous studies [9,10]. These prior studies extended the application of CNN to source type classification and localization of inter-floor noise. Several state-of-the-art CNNs [49][50][51], which have been widely employed in many other applications, were tested against inter-floor noise classification tasks using a dataset obtained from the campus building. The inter-floor noise classification tasks to be stated and solved in this study proceeded upon the assumption that CNNs designed for image recognition can be adopted for these tasks.
In this study, VGG16 [51] is employed as a CNN-based feature extractor from an interfloor noise in the form of log-scaled Mel-spectrogram P. This architecture showed the best performance among several state-of-the-art CNNs for inter-floor noise classification in the previous study [10]. An alternative to this approach is a one-dimensional CNN, which does not require conversion of the inter-floor noises to image-like features. The kernels of one-dimensional CNN eventually learn a set of Mel-filters, which can be considered as a set of basis functions in different frequencies [52][53][54]. An audio signal filtered by the bottom convolutional layers of a one-dimensional CNN can be compared with P. At this point, P is almost equivalent to the output of the bottom convolutional layers of a onedimensional CNN. Figure 4 illustrates the flows of the inter-floor noise classification using the two-dimensional CNN ( Figure 4a) and one-dimensional CNN (Figure 4b). Two-dimensional CNN-based classification starts with (1) conversion of a signal to a log-scaled Mel-spectrogram P, (2) convolutional layers and fully connected layers composing VGG16 as a feature extractor, (3) an adaptation layer (fc), and finishes with (4) classification using a softmax function. These were implemented with TensorFlow [55]. An inter-floor noise signal is converted to a log-scaled Mel-spectrogram P ∈ R H×W through the following steps, where the height H and width W are both defined as 224 by the input size of VGG16. This conversion was implemented with librosa [56]. A signal s ∈ R l s containing an inter-floor noise event is extracted from an audio clip in the dataset, where l s represents f s times the signal length t. The event start position in s is detected using the onset detection described in Section 3. Hz for the given f s . The windowed short time sample x w , (w = 0, · · · , W − 1) is converted to a spectral power using a discrete Fourier transform The start position of the next short time sample is determined by the hop size, h = {30, 99, 197, 296, 394, 591} sample, to achieve W of P for the given t. A block of the windowed short time samples x = [x 0 | · · · | x W−1 ] ∈ R N×W is converted to a power spectrogram X by Equation (6). Then, it is converted to a Mel-spectrogram where F is a Mel-filterbank, which changes the scale of frequency to the Mel-scale. The maximum frequency of the filterbank is set as 5 kHz, because most of the signals in the dataset exist below this frequency, and other inter-floor noise studies deal with frequency ranges below 5 kHz [57,58]. The scale of the entries of M is also rescaled to obtain a log-scaled Mel-spectrogram, as follows: VGG16, the two-dimensional convolutional layers and fully connected layers employed in this work, was originally designed for image-recognition tasks. It has three input channels for the reception of a batch of color images, followed by 13 convolutional layers and three fully connected layers. Because a single two-dimensional feature is obtained via conversion of an inter-floor noise event to P, the feature is given to all channels to train the weights Θ of the CNN and to test a trained CNN. As described in Section 2, the datasets obtained in this study are sparse compared to those adopted in other data-driven approaches. The Θ was initialized with weights pre-trained on a large-scale dataset (Ima-geNet) [59] to mitigate the problems originating from data sparsity. This can be viewed as transfer learning [60], which has already demonstrated its effectiveness in the image domain [61], as well as for inter-floor noise classification [9,10]. The image and the sound representations of the inter-floor noise data are considered different. However, the interfloor noise events are converted to image-like representations P and used. Although the datasets are from different domains, low-level notions such as edges and shapes can be shared to learn distribution of a new task [48]. ImageNet is a large-scale dataset and adopted in transfer learning studies. Knowledge obtained from a large dataset has a higher chance to have sharable low-level notions than that via learning from scratch on a sparse dataset. In addition, it can prevent the over-fitting and contribute to the generalization. Use of low-level notions from a totally different domain knowledge and their performance improvements were presented [61,62]. In addition, use of a large-scale dataset also studied in low-shot learning [63]. This learns the metric using a large dataset and test it against unseen data. The output size of VGG16 is already given as 1000 by the source task, which is classification of ImageNet. It needs to be reduced to the size of the label space of a target Dim(Y T ). This was realized by adopting an additional fully connected layer, called the adaptation layer, with a reduced number of nodes n = Dim(Y T ). The weights between the last layer of VGG16 and the adaptation layer θ ∈ R 1000×n were initialized as random numbers following a normal distribution with a standard deviation of 0.01 [50,51,64]. The bias of the adaptation layer is initialized as 1 [50]. The pseudo-probability of n categories y i ∈ Y T , (i = 1, · · · , n) for a given input P is represented asŷ ∈ R n . It is obtained by inputting the output from the adaptation layer o ∈ R n into a softmax function The classification into a category is A one-dimensional CNN computes one-dimensional convolution directly on the raw waveform and does not require feature engineering, such as converting audio signals to image-like features. SoundNet [65], as illustrated in Figure 4b, is adapted as a feature extractor after adding two adaptation layers to the top of the -1 convolutional layer (conv7) replacing the top convolutional layer (conv8). Because SoundNet is fully convolutional and summarizes raw waveforms via one-dimensional convolution and max-pooling, its waveforms are reduced to single values after passing through a few convolutional layers. Moreover, waveforms with different time lengths are summarized as different lengths of the outputs of the convolutional layers. Consequently, the number of weights of the adaptation layers depends on the input size and this hinders the accurate identification of the effect of t. Therefore, s with t = {0.152, 0.501, 1.00, 1.50, 2.00, 3.00} s was zero-padded to obtain an equal t = 3.00 s, and its amplitude was rescaled to values between [−1, 1], which is filtered by the convolutional layers, and reaches an output size of 5120. The output from the convolutional layers is summarized by 5120 × 1024 (fc1) and 1024 × n (fc2) weights. The Θ in SoundNet is initialized with weights pre-trained on a large-scale dataset (two million videos [65]) for finding a relationship between audio inputs and their corresponding image objects. The classification of inter-floor noise to source type or position categories is carried out using Equations (9) and (10).

Network Training
The Θ was trained by minimizing the cross-entropy loss and L 2 -regularization of the weights of the adaptation layers θ, on a given training/validation dataset using the mini-batch gradient descent with batch size of 64, where y i represents a one-hot-encoded label of a given P. λ is a strength of the L 2 -regularization for θ to avoid overfitting. The λ and learning rate η were selected using the random search method, which determines the optimal hyperparameters from the randomly sampled values [66]. Each hyperparameter follows the uniform distribution on the log-space in range of [10 −4 , 10 2 ]. This search method reduces the effort for hyperparameter searching compared to the grid search method [67] and returns nearly optimal values. In this study, an optimal parameter pair of a given target domain D T = {P i k ∈ X T , Y T } is obtained via five-fold cross-validation, where i and k denote category and the number of data in the corresponding category, respectively. This can be realized as follows. (1) Fifty hyperparameter pairs are generated. (2) A hyperparameter pair with the maximum mean validation accuracy on five training/validation-folds is selected after 10 epochs of training. (3) Five predictive functions f Θ * j (·), (j = 1, · · · , 5) are obtained through 50 epochs of training on the five training/validation-folds, where Θ * j represents the weights of the whole network with the highest validation accuracy against the jth training/validation-fold.

Inter-Floor Noise Source Type Classification and Localization Tasks
Several source type classification and localization tasks are prepared to verify the CNN-based inter-floor noise classification and identify its limitations. These tasks verify the method via source type classification and localization in two apartment buildings, and knowledge transfer between two actual apartment buildings. The prepared tasks are presented in Tables 1 and 2 Table 1 presents tasks and datasets for training/validation and testing predictive functions. The notation of the task names in the first column is represented as T task type, training/validation dataset or T task type, test dataset|training/validation dataset . Table 2 presents tasks prepared for verifying the knowledge transfer between the two apartment buildings. These tasks test the source type classifiers and locators using the inter-floor noise obtained from the other apartment building. The notation of the task name is represented as T task type, test dataset|training/validation dataset .

Source Type Classification in a Single Apartment Building
The source type classification tasks summarized in Table 1 are described in the following (a)-(c). These tasks are prepared using the datasets from the two apartment buildings, APT I and APT II, to verify the CNN-based source type classification of inter-floor noise in a single apartment building.
(a) T t,APT 1 . This task cross-validates the source type classification with the inter-floor noise on the floors above/below in APT I. This is realized by finding predictive functions f Θ j t,APT 1 (·) with a label space Y t, APT 1 against five-folds (j = 1, · · · , 5) of labeled training/validation data pairs {P i k , y i } from D APT 1 . (b) T t,APT 1 |APT 1 . This task verifies the source type classification against the interfloor noise generated at unlearned positions in the same apartment building, APT I. f Θ j t,APT 1 (·) obtained in T t,APT 1 is tested against the test data pairs {P i k , y i } from D APT 1 .
(c) T t,APT 2 and T t,APT 2 |APT 2 . These tasks verify the same things against the noise samples on the floors above/below in APT II. f Θ j t,APT 2 (·), (j = 1, · · · , 5) with Y t,APT 2 are obtained and tested against {P i k , y i }, y i ∈ Y t,APT 2 from D APT 2 .

Localization in a Single Apartment Building
Localization tasks presented in Table 1 are described in the following (a)-(c). These tasks are prepared to verify the localization of inter-floor noise in a single building (APT I and APT II). In addition, localization of inter-floor noise from the same XY position but transmitted through different floor sections (different Z position) is verified.
(a) T p,APT 1 . This task cross-validates locators against the inter-floor noise on the floors above/below in APT I. This is realized by finding f Θ j p,APT 1 (·) with Y p,APT 1 on fivefolds (j = 1, · · · , 5) of labeled training/validation data pairs {P i k , y i }, y i ∈ Y t,APT 1 from D APT 1 . (b) T p,APT 1 |APT 1 . f Θ j p,APT 1 (·) obtained in T p,APT 1 are tested against the inter-floor noises generated at the unlearned positions in the same apartment building, APT I. The pre-dictive functions are tested against the test data pairs {P i k , y i }, y i ∈ Y t,APT 1 from D APT 1 . (c) T p,APT 2 and T p,APT 2 |APT 2 . They verify locators using the same approach for T p, APT 1 and T p,APT 1 |APT 1 with the inter-floor noise obtained from APT II.

Knowledge Transfer between the Apartment Buildings
If the trained source type and localization knowledge can be reused for inter-floor noise identification tasks, without required training or training under data sparsity on samples from other similar apartment building, then this method can be used widely. The knowledge transfer tasks presented in Table 2

Performance Evaluation
F1 score in Equation (12) is implemented with Scikit-learn [68] and adopted to measure the performance of the predictive functions. TP, FP, and FN represent true positive, false positive, and false negative, respectively. This metric evaluates the classification results considering the imbalance of the inter-floor noise dataset.
where Precision = TP TP + FP and Recall = TP TP + FN .
(12) Table 3 presents F1 scores of the source type classification results with t variation. The F1 scores with HH and HD set to the same category are placed in parentheses to distinguish them from the results of the original task. As exhibited by the differences between the F1 scores in two representations, f Θ j t,APT 1 (·) and f Θ j t,APT 2 (·) confuse HH and HD for all tasks. Figure 5 shows the time-frequency representations of HD and HH at the campus building (Figure 5a,d), APT I (Figure 5b,e), and APT II (Figure 5c,f). HD at the campus building shows repeated impact noise patterns (peaks) induced by bouncing of the hammer when it was dropped on terrazzo tile flooring. However, HD obtained from both apartment buildings was generated with a hammer on the slabs covered with vinyl floorings, which prevented the hammer from bouncing and mitigated the peak patterns. Hence, the source type classification results of assuming HH and HD as the same category need to be additionally considered. Figure 6a presents trend of the F1 scores of the source type classification results with t variation. This plot visualizes the results of the test tasks (T t,APT 1 |APT 1 and T t,APT 2 |APT 2 ). The F1 scores of the source type predictive functions on inter-floor noise from the unlearned positions, are lower than those of the cross-validation tasks (T t,APT 1 and T t,APT 2 ). SoundNet-based classifiers under-performed for source type classification tasks, as well as exhibited significantly low F1 scores for s at t = 0.152 and 0.501 s. Additionally, t influences the similar variations of F1 scores related to the VGG16-based classification.

Source Type Classification Results in a Single Apartment Building
In summary, the CNN-based approach demonstrated the feasibility and generalizability of the source type classification to inter-floor noise from actual reinforced concrete apartment buildings. The adapted two-dimensional CNN exhibited marginally better performance than that of the adapted one-dimensional CNN. This may be explained by the sharable low-level notions in the pre-trained knowledge.  In addition, appropriate selection of t influences the performance with improved F1 scores of cross-validation, as well as test tasks against the inter-floor noise transmitted through the unlearned floor sections.

Localization Results in a Single Apartment Building
The first eight rows of Table 4 present F1 scores of the localization results via five-fold cross-validation, T p,APT 1 and T p,APT 2 , with t variation. The underlined values represent F1 scores of the floor classification via rearrangement of the localization results to their corresponding floors. Because 95.7% of the actual inter-floor noise complaints were identified as noise from floors above/below [4], these floor classifications are considered to be the main interest in real application. Figure 6b,c present trend of the localization results and those of the floor classification tasks with t variation, respectively.
The remaining eight rows present F1 scores of test results via T p,APT 1 |APT 1 and T p,APT 2 |APT 2 . T p,APT 1 |APT 1 tests f Θ j p,APT 1 (·) on {P i k , y i }, y i ∈ Y p,APT 1 . However, there exists a label-space difference between f Θ j p,APT 1 (·) (i.e., Y p,APT 1 ) and Y p,APT 1 . Accordingly, 1-C-a, 1-D-a, and 1-E-a cannot be considered any of the categories learned by f Θ j p,APT 1 (·) because their XY positions relative to the receiver are different from those learned by f Θ j p,APT 1 (·). Therefore, the localization test results are evaluated separately, as follows.
(a) The position labels 1-A -a and 1-B -a are considered 1-A-a and 1-B-a, respectively.
These realize the localization of inter-floor noise transmitted through the unlearned floor section (4 F → 3 F in APT I). (b) Localization results of 1-C-a, 1-D-a and 1-E-a are squeezed into y i ∈ Y p,APT 1 and approximated to floor classification because their XY positions cannot be mapped to y i ∈ Y p,APT 1 directly.
If the noise sources with the same XY positions (e.g., 2-A-a and 2-A -a) are assumed to be the same category, {P i k , y i }, y i ∈ Y p,APT 2 for T p,APT 2 |APT 2 can be mapped to y i ∈ Y p,APT 2 by f Θ j  In summary, the CNN-based locator demonstrated the feasibility and generalizability of floor classification of inter-floor noise generated on the floors above/below, and those transmitted through the learned/unlearned floor sections in actual apartment buildings. It can contribute to minimize human effort for data gathering pertaining to the floor classification problem. However, localization of inter-floor noise was available for that transmitted through the learned floor sections alone. The adapted one-dimensional CNNbased locator with t = 0.152 and 0.501 s exhibited significantly low F1 scores. Table 5 presents evaluated performance of the knowledge transfer tasks with t variation, as explained in Section 3.4.3. The first eight rows present results of the source type knowledge transfer tasks T t,APT II|APT 1 and T t,APT I|APT 2 . Their performance was evaluated with F1 scores, the values in parentheses are F1 scores when HH and HD are assumed to be the same category. The source type knowledge transfer results showed lower F1 scores than those of the cross-validation and test tasks. This performance degradation may originate from the domain difference. The underlined values in the remaining rows present F1 scores of the localization knowledge transfer results of T p,APT II|APT 1 and T p,APT I|APT 2 . They were obtained by rearranging the localization results to their corresponding floors. As illustrated in Figure 6d, these F1 scores degraded as t increased. These may be explained by the difference of the reverberation characteristic between the inter-floor noise obtained in the two apartment buildings. Therefore, when target environment (apartment building) is changed, using short and early part of inter-floor noise is effective for localization with the source domain knowledge. Although the F1 scores of the results are lower than those of the cross-validation and the test tasks, they are considered to be meaningful, because they are better than the chance level, and demonstrate the feasibility of source type and localization knowledge transfer between the actual apartment buildings. The performance degradation is common for the one-dimensional CNN-based locator with t = 0.152 and 0.501 s.

Results of Knowledge Transfer between the Apartment Buildings
The feasibility of the knowledge transfer can be considered, such that a CNN can extract the generalized feature representations of source types and positions from interfloor noises. Inter-floor noise filtered by learned one-dimensional CNNs with t = 1.00 s for source type classification and localization were embedded in a two-dimensional space using a dimension reduction algorithm t-stochastic neighbor embedding (t-SNE) [69] for visualization of the generalized features. Because the dimension of fc2 depends on the number of categories (n), output from fc1 were adopted in this case. Figures 7 and 8 present the t-SNE of the inter-floor noises filtered by the CNNs. Figure 7a,b illustrate the t-SNE of source type features in T t,APT II|APT 1 and T t,APT I|APT 2 , respectively. The t-SNE of the same source types within the same apartment buildings are categorized into groups. In particular, the t-SNE of the same source types in different apartment buildings are clustered around the groups. This demonstrates that the CNN-based feature extraction can build generalized source type representations of the inter-floor noises. Figure 8 illustrates the t-SNE of floor features in T p,APT II|APT 1 and T p,APT I|APT 2 . The t-SNE of fc1 are categorized into groups of above and below floors. However, the groups are not well clustered as the t-SNE of the source type features. It can be inferred that although the feature extraction of a generalized floor knowledge using CNN is feasible, but it is not significantly effective. The results in Table 5 also exhibit the corresponding values. (c) (d) Figure 6. Results of inter-floor classification and classification knowledge-transfer tasks evaluated using F1 score. HD and HH were considered the same category for source type classification results. (a) Cross-validation and test results of the source type classification tasks in Table 3. (b) Cross-validation and test results of localization tasks in Table 4. (c) Rearranged localization results to their corresponding floor. (d) Results of the knowledge transfer tasks in Table 5. Upper floor (c) Lower floor (d) Figure 8. The symbols in purple and those in yellow represent t-SNE of learned and unlearned signals, respectively, for a single selected floor among floor above and below. In addition, those in gray represent the other floor. (a,b), respectively, colorize signals from the floor above and below in T p,APT II|APT 1 . (c,d), respectively, colorize signals from the floor above and below in T p,APT I|APT 2 .

Input Signal Length Selection
It is recommended to fix t for implementation of the method for real-world application. The SoundNet-based predictive functions exhibited a lower performance in all inter-floor noise identification except in T p,APT 2 , as well as significantly low F1 scores for s at t = 0.152 and 0.501 s. Therefore, VGG16-based predictive function is considered alone in this discussion. The source type classifiers and locators with t = 1.50 and 2.00 s exhibited the best F1 scores most frequently. Additionally, the locators with t = 0.501 s showed clear effectiveness for localization knowledge transfer.

Conclusions and Future Study
In this study, the generalizability of the CNN-based supervised learning method for source type classification and localization of inter-floor noise was demonstrated via several designed tasks. Furthermore, the feasibility of source type and localization knowledge transfer between apartment buildings was demonstrated. These were demonstrated using inter-floor noise datasets obtained from two actual apartment buildings.
The source type classifiers and locators consist of CNN-based feature extractors followed by an adaptation layers and a softmax function. Pre-trained weights on largely annotated image datasets were used for initialization of the feature extractors. Noise events in the signals were detected using the HOS-based algorithm and inputed to one-dimensional (SoundNet) and two-dimensional (VGG16) CNNs. The signal length in time t was selected empirically.
The source type classifiers and locators were verified against several tasks, which are five-fold cross-validation on inter-floor noise transmitted through a learned floor section in individual two apartment buildings, test on that transmitted through an unlearned floor section in the same apartment buildings, and test on that obtained from unseen apartment buildings. The performance of the method for each task was evaluated using the F1 score. VGG16-based source type classifier and locator performed better than those of SoundNet-based. For cross-validation of the source type classification with t = 2.00 s, VGG16-based source type classifier showed F1 scores of 0.9731 and 0.9551 for APT I and APT II dataset, respectively. If HD and HH are considered the same category, these scores were improved to 0.9798 and 0.9953. For test of the prepared source type classification tasks using inter-floor noises transmitted from the unlearned source positions, F1 scores of source type classification results by the same classifiers dropped to 0.8303 and 0.7991, respectively. Furthermore, if HD and HH are considered the same category, the scores reached 0.9563 and 0.9875. For cross-validation of the localization, VGG16-based locator with t = 2.00 s showed F1 scores of 0.9574 and 0.9272 for APT I and APT II dataset, respectively. Rearranging these results to their corresponding floors modified F1 scores to 0.9955 and 0.9786, respectively. However, the method showed limitation to test tasks using noise signals transmitted through the unlearned floor sections. However, rearranged results to corresponding floors classification dropped less than 2% from the those of cross-validation tasks. These results present the generalizability of the identification method with a single sensor in an actual multi-storey reinforced concrete building and the feasibility of the knowledge transfer between similar buildings.
In conclusion, results in this study contribute to identify inter-floor noise in real multistorey apartment buildings and other single sensor-based approaches. Future studies should focus on resolving the limitations of the method by using the geometrical information of the buildings for localization of signals at unlearned positions or domain adaptation techniques for better knowledge transfer. In addition, although the CNN-based approaches showed their generalizability, the datasets are only suitable for the designed tasks. To deal with tasks with a high degree of freedom, the data are expected to be updated considering many different scenarios for better generalization, e.g., change of the receiver's position and orientation.

Acknowledgments:
The authors thank Sangkyum An and Minseuk Park for providing the experimental sites.

Conflicts of Interest:
The authors declare no conflict of interest.