Comparative Study of Dimensionality Reduction Techniques for Spectral–Temporal Data

: This paper studies the use of three different approaches to reduce the dimensionality of a type of spectral–temporal features, called motion picture expert group (MPEG)-7 audio signature descriptors (ASD). The studied approaches include principal component analysis (PCA), independent component analysis (ICA), and factor analysis (FA). These approaches are applied to ASD features obtained from audio items with or without distortion. These low-dimensional features are used as queries to a dataset containing low-dimensional features extracted from undistorted items. Doing so, we may investigate the distortion-resistant capability of each approach. The experimental results show that features obtained by the ICA or FA reduction approaches have higher identiﬁcation accuracy than the PCA approach for moderately distorted items. Therefore, to extract features from distorted items, ICA or FA approaches should also be considered in addition to the PCA approach.


Introduction
It is very useful for many applications to reduce the dimensionality of data for higher processing speed, if the dimension-reduced data, also called features, can faithfully preserve important information from the original data. For example, in the feature-based object recognition problem [1], one important step, known as the feature extraction step, is to find a set of low-dimensional features to represent the underlying high-dimensional image. Doing so can significantly reduce the recognition time as well as storage space.
Another application of performing dimensionality reduction is to increase the processing speed. For example, a music identification system can significantly speed up the identification speed with the multi-resolution approach [2,3]. In this approach, a set of low-dimensional features is derived from the higher-dimensional features of a query audio clip. Based on the low-dimensional features, a small set of candidate objects are identified. Next, high-dimensional features are used to pick the best match among the chosen candidates. By comparing only a small set of candidates with high-dimensional features, the computational cost is dramatically reduced. However, to maintain high recognition accuracy, the low-dimensional features should have sufficient discriminant power. Thus, it is important to retain more useful information while performing dimensionality reduction.
In many applications, we use the short-time Fourier transform (or other time-frequency transformations) to observe the frequency characteristics of the signal along time [4] for a piece of signal, such as speech, music, or seismic wave. The computed coefficients, known as spectrogram, are a type of spectral-temporal features. Usually, this type of feature is arranged in a block (matrix) form, with one axis representing time and the other axis frequency. In this paper, the spectral-temporal features under consideration are the ISO's (International Organization for Standardization) MPEG-7 (motion picture expert group) audio signature descriptors (ASD), which are shown to be effective for music identification (also to be covered in Section 2) [2,3,5]. The independent component analysis (ICA) [6,7] has been shown to effectively solve the cocktail-party problem, where several speech signals are mixed together. Recently, it is also used as a processing step in the recognition of EEG (Electroencephalography) waves [8,9]. Due to the underlying algorithm, components decomposed with ICA have the sign, scale, and permutation ambiguities. Therefore, we are unable to know which independent component has higher energy. With the scale ambiguity, it is not possible to choose the ICA-decomposed components based on component energy to perform dimensionality reduction. Therefore, we suggest an approach to walk around this difficulty and show the experimental results in this paper.
For many instances, we would like to investigate the factors affecting the observation. While this type of analysis can be carried out by the careful design of experiments, it can also be accomplished statistically by using factor analysis (FA). According to Reference [10], " . . . factor analysis is to describe the covariance relationships, if possible, among many variables in terms of a few underlying, but unobservable, random quantities called factors". FA has many applications, such as anomaly detection [11,12]. Despite its popularity, only a few papers study the use of this method as a general dimensionality-reduction technique. In this paper, we will present our results of using FA as a reduction method and show the relative strength and weakness of this approach.
Although a complete survey of available dimension-reduction methods is available in the literature [13], unfortunately, the survey does not contain any quantitative comparison among the approaches. Previously, we have studied some dimensionality-reduction methods for MPEG-7 ASD, such as PCA (principal component analysis) [2,3]. In these papers, we use PCA, but not ICA or FA, to decompose the feature matrix, and then keep large components and discard smaller ones. Unfortunately, if an audio item is distorted, such as an air-recorded item, the distortion may also be present in large components. Therefore, the dimension-reduced features cannot faithfully preserve important information in the feature matrix. Previously, we have applied ICA for this problem and the results show that the ICA has better accuracy [14]. In this paper, we extend the work of Reference [14] to include FA in the comparison and to provide a detailed explanation for each of the compared approaches. To the best of the author's knowledge, we have not read any paper conducting experiments with a similar comparison. In addition, although the experiments apply reduction techniques only to the MPEG-7 ASD, the presented approaches, nevertheless, are general approaches and can be easily applied to other types of features and application scenarios. This paper is organized as follows: Section 2 covers the basics of the MPEG-7 features, which will be used in the experiments as the full-dimensional data. Section 3 describes how to perform dimensionality reduction with the PCA, ICA, and FA methods. Section 4 covers experiments and results. Finally, Section 5 is the conclusion.

Music Identification
Music identification [2,3,15] is a technique to identify an audio clip from a large set of soundtracks based on the signal waveform. Therefore, even if the same person sings the same songs two times, these two audio works are considered as different. In terms of copyright protection, this distinction is actually an advantage, as there are many cover songs available in the market. Other applications of music identification also include query-by-exact example, broadcasting program monitoring, and so on [2].
Although the waveforms of different soundtracks are different, it is not possible to perform a comparison based on PCM (pulse code modulation) samples in the soundtracks because the computational cost is way too high [2]. Therefore, instead, we perform comparison based on some unique features extracted from the waveform. There are many proprietary schemes available to extract the features, as Shazam [16]. Other than the proprietary fingerprinting schemes, the ISO's MPEG-7 audio standard also provides high-level tools to compute audio features suitable for music identification, known as audio signature tool. This tool computes ASD for each segment of the audio waveform. It is shown that the descriptors have satisfactory accuracy when used for music identification [2,3]. Unfortunately, as the dataset contains such a huge number of soundtracks, directly using the ASD would still require a lot of computation.
To cope with the high computational complexity, multi-resolution detection has been proposed [2], as shown in Figure 1. Firstly, low-resolution features are used to find a list of possible candidates. Second, the full-resolution features (fingerprints) are used only to compare the candidates in the list. By doing so, the computational cost can be greatly reduced.

Overview of MPEG-7 Audio Signature Descriptors
The MPEG-7 audio standard has many tools for audio feature extraction, audio identification, and audio recognition [5]. Some of the audio features are directly derived from the waveform of the audio signal. These low-level features can be used standalone or to construct high-level descriptors. The high-level descriptors may be used for more complicated tasks such as musical instrument recognition. In this paper, we use the ASD, one of the high-level descriptors, as the full-dimensional features to be reduced. To be complete, the computational steps of the ASD computation are given below. After the above computational steps, the obtained ASD can be represented in a matrix form, as depicted in (1), for a piece of 14-s audio. In the matrix, the horizontal direction represents frequency and the vertical direction represents time:

Overview of 2-D PCA Computation
PCA has been widely used for reducing the dimensionality of features for many years [11]. In this paper, we have features arranged in a 2-D array form, not in a vector form. Therefore, we need to conduct the PCA reduction two times: first time in the vertical direction and the second time in the horizontal direction. For illustration purposes, we use the entire feature matrix to explain the computation steps. Suppose that the training dataset contains A (1) , · · · , A (N) feature matrices. To conduct PCA, we follow [17] and treat each column of A (k) as one measurement from a trail. By combining all A (k) together, we have the following matrix: In this representation, each column vector b k is one vector subject to be mapped to a lower-dimensional space. The PCA computational steps are as follows: (a) Remove means from the dataset. Let b be the vector of average for all b k . We compute: to remove means. Note that it is not needed to make unity variances for elements in x k . (b) Construct X matrix as: (c) Decompose XX T matrix as: where T denotes transpose, Λ is a diagonal matrix containing eigenvalues of XX T arranged from large to small, and: is an orthonormal matrix that contains the corresponding eigenvectors with p = 29 (d) Use only p = 3 in (6) to perform reduction and denote the resultant matrix asṼ.
Then, compute dimension-reduced matrix R, size of 24N × 3, from full-dimensional features X by: (e) After this step, the reduction in row direction of A (k) is complete. The next step is to perform a reduction in the horizontal direction. To do so, we treat R T as X in (5) and repeat the computational steps one more time to complete the 2-D reduction.
In our previous paper, we have shown that a variation of 2-D PCA is slightly better. Thus, in the experiments, we use a slightly different method to compute the lowdimensional features [3].

Introduction to ICA
The basic ICA algorithm consists of three main steps: centering, whiting, and maximizing non-Gaussianity [6,7]. Suppose that each a k in (1) is one vector subject to ICA, then the ICA is computed as follows.
(a) Centering: Centering is to remove the average from the received samples. This step is equal to (3) in PCA. (b) Whitening: This step is used to remove the correlation between components through the covariance matrix. Its computational steps are similar to that of the PCA except a scaling. Specifically, we use: to compute the diagonal matrix D containing eigenvalues. Then: is the resultant vector after the whiting step. It can be easily verified that if z k is an observation (outcome) of a random vector z, then E zz T = I, where E{·} is the expectation operation [6]. Note that (9) is used only to compute whitened measurements z k . Further steps are necessary to obtain the transformation matrix to compute the final results. (c) Maximizing non-Gaussianity: Since we assume that the components of interests are not Gaussian distributed, it is natural to maximize non-Gaussian components. There are two widely used methods to measure the level of non-Gaussian, namely, kurtosis and negentropy. Please note that maximizing non-Gaussianity is not the only approach to perform ICA. Another widely used criterion is minimizing mutual information [7].
The kurtosis of a random variable x is computed as: where µ is the mean value of x. According to (10), a Gaussian random variable has a kurtosis of zero. Therefore, the procedure of maximizing non-Gaussian is to maximize the absolute value of kurtosis. Specifically, let: where the vector w is a transformation vector. Note that we use the random vector z in (11) because (10) involves the expectation operation. The objective function in this case is: The problem now becomes maximizing J 1 (y) with respect to a vector w with w = 1 for all observations of z, totally 24N vectors. If more than one component is needed, we can compute multiple w vectors to form a transformation matrix W. This constraint optimization problem can be solved numerically, say, by fixed-point iteration.
The entropy of a random variable x with a density function f (x) is computed as: It can be shown that the Gaussian distribution has the highest entropy among all distributions. Based on (13), we define the differential entropy, or negentropy, as: where x G is a Gaussian random variable with the same variance as x. The goal of maximizing non-Gaussian can then be achieved by maximizing the negentropy. To use this concept, we let the objective function be J 2 (y), where y is given in (11), and want to maximize J 2 (y) with respect to the vector w. In the practical implementation, (14) is difficult to compute. Therefore, we use an approximate equation, such as: where v is zero-mean, uni-variance Gaussian random variable and G is a function chosen from some candidate functions, such as: where α is a constant. The optimization problem in (15) can also be numerically solved by the fixed-point iteration algorithm.

Proposed ICA Reduction
As mentioned previously, the components obtained by the ICA has scaling ambiguity. Therefore, we are unable to directly use the ICA components with large energy as the dimension-reduced features. After second thoughts, we decide to reduce the dimensionality in the whitening step, detailed as follows: 1.
Perform the centering and whitening steps. Simply following (8), we have: 2.
Keep only eight components in V to reduce the dimensionality. As before, the eigenvalues in (8) are arranged from large to small. Thus, we have a 29 × 8 matrix: 3. The next step is supposed to compute the whitening version of x k . However, we are unable to useṼ in the VD − 1 2 V T term becauseṼ is not a square matrix. To overcome this problem, we follow (24) in the proposed FA approach (to be discussed later), which in the present form is:z where D is a 8 × 8 diagonal matrix containing the eigenvalues associated with v j in (18), andz k has a size of 8 × 1. Recall that there are 24 x k vectors derived from one feature matrix A.

4.
Perform the maximizing non-Gaussianity step. In this step, eight w vectors are obtained to form a transformation matrix W (size of 8 × 8). Then, compute:

5.
Treat y k as x k and repeating the whitening procedure again. However, this time we keep only one vector in (18) to compute the final low-dimensional features.
The sizes of matrices involved in the proposed ICA approach are given in Figure 2. TheV matrix is computed from (19) and the W matrix is from non-Gaussian step. Recall that the computation ofV is closely related to the PCA approach. Therefore, the ICA reduction is an extension of the PCA approach, with the maximum non-Gaussian step in between two PCA computations. The sizes of matrices involved in the proposed ICA approach are given in Figure 2. The matrix is computed from (19) and the matrix is from non-Gaussian step. Recall that the computation of is closely related to the PCA approach. Therefore, the ICA reduction is an extension of the PCA approach, with the maximum non-Gaussian step in between two PCA computations.

Introduction to Factor Analysis
Factor analysis includes exploratory factor analysis and confirmatory factor analysis [18,19]. In this paper, we only consider exploratory factor analysis [20]. In addition, there are many different types of factoring, such as principal component analysis and canonical factor analysis [12]. We use the principal component analysis method in this paper. This type of factor analysis is also called principal component factor analysis [12].
Suppose that there are m observable random variables, to , with respective mean values of ̅ to ̅ under investigation. Furthermore, suppose that there are unknown constants ℓ , with unobservable random variables with i from 1 to m and j from 1 to n, where n is less than m. We then can have the following expression: where is random noise. Sometimes, we write (21) in short form as: It is a common practice to assume that and are independent, ̅ = 0, and the covariance matrix of is the identity matrix. In addition, all random variables in are independent with finite variances. To find any solution to (22), we add additional constraints to obtain [13]: where Cov(•) denotes the covariance matrix and Η is the covariance matrix of . As contains independent random variables, Η is a diagonal matrix. The computed is called the factors and L is called the loading matrix.

Introduction to Factor Analysis
Factor analysis includes exploratory factor analysis and confirmatory factor analysis [18,19]. In this paper, we only consider exploratory factor analysis [20]. In addition, there are many different types of factoring, such as principal component analysis and canonical factor analysis [12]. We use the principal component analysis method in this paper. This type of factor analysis is also called principal component factor analysis [12].
Suppose that there are m observable random variables, x 1 to x m , with respective mean values of x 1 to x m under investigation. Furthermore, suppose that there are unknown constants i,j with unobservable random variables e j with i from 1 to m and j from 1 to n, where n is less than m. We then can have the following expression: where ζ j is random noise. Sometimes, we write (21) in short form as: It is a common practice to assume that e and ζ are independent, e j = 0, and the covariance matrix of e is the identity matrix. In addition, all random variables in ζ are independent with finite variances. To find any solution to (22), we add additional constraints to obtain [13]: Cov(x) = LL T + H, where Cov(•) denotes the covariance matrix and H is the covariance matrix of ζ. As ζ contains independent random variables, H is a diagonal matrix. The computed e is called the factors and L is called the loading matrix.

Proposed FA Reduction
To use FA, we need to find the loading matrix L. By using the principal component factor analysis [8], the steps to compute the proposed FA reduction approach are given below.
(a) Follow the ICA computation in (17) to obtain V matrix, and then choose p components to construct V p matrix. (b) Construct the loading matrixL by: In the simulation, we use p = 3. (c) Compute the factor scores for one vector x k by: Then, the obtained f k is a dimension-reduced vector in one direction (i.e., time information is lost). (d) To further reduce the dimensionality of the features in another direction, we perform the same steps a second time by treating f k as x k .
When completing the procedure, we have an 3 × 3 matrix as the factor scores for one feature matrix A. In the simulation, we discard the element of (3, 3) and use the rest of 8 elements.

Experimental Settings
The experiments were carried out by computer simulations. To do so, we collected 750 soundtracks with various genres from different CD titles. The genres of the soundtracks included classical music, piano solo, soft music, pop music, and so on. The duration of each excerpted soundtrack was 30 s. These soundtracks served as the reference (database) items. Next, testing (or query) items with a duration of 15 s were excerpted from the reference items with random starting points. We used random starting points to simulate the actual recording situation. These items were referred to as the original testing items. The original testing items were subject to some widely encountered distortion attacks, such as audio compression and recording, to be described below.
To evaluate the detection accuracy after audio compression, we encoded the original test items with MP-3 (MPEG-1 layer 3) [21] in 192 kbps (kilo bit per second) and 96 kbps bitrates, respectively. As the original test items were stereo audio, they were encoded in the joint stereo mode. Then, the compressed items were decoded to PCM items. These items were downmixed from stereo to mono before testing.
To further investigate the identification accuracy under the actual recording environment, we also collected recorded items. Specifically, we played the original test items with a notebook and recorded the audio signal through two different paths, as shown in Figure 3. In the first path, the audio output port of the notebook is directly connected to the audio input port of a desktop computer. Thus, the signal-recording path contains a digital-to-analog converter, an analog-to-digital converter, and associated anti-aliasing and reconstruction filters. The obtained testing items are called wire-recorded items in the following. In the second path, the audio item is played back through a loudspeaker, and the acoustical signal is recorded by a microphone. The test items recorded with the microphone are referred to as the air-recorded items. Another set of testing items are obtained by adding −20 dB and −10 dB AWGN (additive white Gaussian noise) to the original testing items. The noise energy is relative to the signal energy. The reason that we used both wire and acoustic recordings is to study the impact of the distortion sources. For wire-recorded signals, the distortion is mainly due to the signal processing blocks inside the personal computer, such as the anti-aliasing filter and the analog-to-digital converter. Thus, the amount of distortion may be reasonably small. For the air-recorded items, the distortion sources include the loudspeaker distortion, the room impulse response, the ambient (environmental) noise, and possibly the microphone response. When compared with the wire-recording items, it is reasonable to guess that air-recorded items have much higher distortion. Before conducting the identification task, the reference items and all sets of testing items were converted to MPEG-7 ASD. To have the highest possible accuracy, the high-frequency bands (higher than 4 kHz) were removed before reduction [2,3]. These descriptors are referred to as full-dimensional (FD) features. Next, low-dimensional features were obtained from the FD features by using the PCA, ICA, and FA approaches mentioned in Section 3. Finally, the low-dimensional features were presented as queries to the database as the testing items. Recall that the database contains only undistorted reference items.
The identification of a query item is carried out based on similarity, which is the Euclidean distance between a reference item and the query item. From Figure 1 we know that whether a query item is inside or outside the database is determined by using FD features, therefore, all LD query items are assumed to be excerpted from the reference items. The performance measure of the reduction approaches is based on accuracy. A candidate list is constructed based on the first i-th shortest distances. For example, the 15-th list contains 15 song titles. If the test item is actually in the list, the item is called correctly identified. By doing so, we can compute the accuracy. Recall that the low-dimensional features are used only to find candidates in the multi-resolution approach, and full-dimensional features will be used to identify the test item among the candidates. Thus, practically it is more important to consider the accuracy in the, say, 15-th match than in the 1-st match. Furthermore, in the present application, it is not so much useful to compute the average ranking for a method.

Comparison between Kurtosis and Negentropy Criteria
Because there are two widely used criteria to maximizing non-Gaussianity in computing ICA, we would like to know if these two have comparable performance. We know that a query item is more difficult to identify if it has some distortion. To this end, we compute the low-dimensional query items using these methods for test items that undergone MP-3 compression. The accuracy between the kurtosis and the negentropy approaches is given in Table 1. In the table, the accuracy is computed as: where is the total number of test samples, and the number of test samples within the i-th shortest distances. The results show that the negentropy function is not as good as the kurtosis one, especially in 1-st match. Therefore, we will drop the negentro- The reason that we used both wire and acoustic recordings is to study the impact of the distortion sources. For wire-recorded signals, the distortion is mainly due to the signal processing blocks inside the personal computer, such as the anti-aliasing filter and the analog-to-digital converter. Thus, the amount of distortion may be reasonably small. For the air-recorded items, the distortion sources include the loudspeaker distortion, the room impulse response, the ambient (environmental) noise, and possibly the microphone response. When compared with the wire-recording items, it is reasonable to guess that air-recorded items have much higher distortion. Before conducting the identification task, the reference items and all sets of testing items were converted to MPEG-7 ASD. To have the highest possible accuracy, the highfrequency bands (higher than 4 kHz) were removed before reduction [2,3]. These descriptors are referred to as full-dimensional (FD) features. Next, low-dimensional features were obtained from the FD features by using the PCA, ICA, and FA approaches mentioned in Section 3. Finally, the low-dimensional features were presented as queries to the database as the testing items. Recall that the database contains only undistorted reference items.
The identification of a query item is carried out based on similarity, which is the Euclidean distance between a reference item and the query item. From Figure 1 we know that whether a query item is inside or outside the database is determined by using FD features, therefore, all LD query items are assumed to be excerpted from the reference items. The performance measure of the reduction approaches is based on accuracy. A candidate list is constructed based on the first i-th shortest distances. For example, the 15-th list contains 15 song titles. If the test item is actually in the list, the item is called correctly identified. By doing so, we can compute the accuracy. Recall that the low-dimensional features are used only to find candidates in the multi-resolution approach, and full-dimensional features will be used to identify the test item among the candidates. Thus, practically it is more important to consider the accuracy in the, say, 15-th match than in the 1-st match. Furthermore, in the present application, it is not so much useful to compute the average ranking for a method.

Comparison between Kurtosis and Negentropy Criteria
Because there are two widely used criteria to maximizing non-Gaussianity in computing ICA, we would like to know if these two have comparable performance. We know that a query item is more difficult to identify if it has some distortion. To this end, we compute the low-dimensional query items using these methods for test items that undergone MP-3 compression. The accuracy between the kurtosis and the negentropy approaches is given in Table 1. In the table, the accuracy is computed as: where N total is the total number of test samples, and N i−th the number of test samples within the i-th shortest distances. The results show that the negentropy function is not as good as the kurtosis one, especially in 1-st match. Therefore, we will drop the negentropy approach from further experiments and will only use the kurtosis criterion in the following experiments.

Comparison between Various Approaches
The identification accuracy for the examined reduction methods is presented in Table 2. In the table, the FD column means the query items use full-dimensional features (only removing high-frequency components, as mentioned previously). This accuracy determines the accuracy upper bound of the multi-resolution approach. Based on the accuracy in the FD column, we know that it is easy to identify MP-3 compressed items. For air-recorded or wire-recorded items, the accuracy may be marginally acceptable, depending on the applications. For a query item severely impaired by white noise (−10 dB), the MPEG-7 ASD does not provide enough identification capability. The results show that the PCA approach has a slight advantage if the test item is distortionless (original) or only lightly distorted (192 k MP3). On the other hand, if the distortion is moderate to heavy, the PCA approach does not perform well. Taking the air-recorded items as an example, the ICA-based method has an accuracy of 77.5% in the 15-th match, whereas the PCA approach has only 67.6% accuracy. The accuracy difference is also very large for items with −20 dB noise.
When comparing the ICA and FA approaches, we notice that both have comparable performance. For heavily distorted items (−10 dB AWGN), the ICA is slightly better. However, for such test items, both approaches have very poor performance. Therefore, practically speaking, either ICA or FA approach is suitable for moderately distorted items.
When examining the accuracy of wire-recorded items, we notice that the accuracy of FD features is not as high as expected, such as 99.5%. In contrast, the FD features have 100.0% accuracy for 96k MP-3 items. In a sense, it indicates that the signal processing blocks along with the wire recording path significantly affect the identification accuracy of MPEG-7 ASD. Therefore, this type of descriptors is not very robust to the wire-recording attack.
Although the air-recorded items suffer from many types of distortion, their FD accuracy is higher than expected (considering we use only the 1-st match). On the other hand, the studied approaches can only provide moderate robustness toward this type of distortion. Therefore, for this type of distortion, better reduction techniques need to be developed.

Discussion
When comparing the computational steps of PCA, ICA, and FA, we know that they are all based on PCA. However, ICA and FA do not directly use the eigenvector matrix V to compute dimension-reduced features. Instead, ICA has a procedure to maximize non-Gaussian in order to find the transformation matrix. Similarly, FA uses the pseudo inverse as the transformation matrix. Either method seems to be more robust to a certain level of distortion. Therefore, we conclude that directly using PCA may not be sufficient in terms of retaining important information for distorted items. Consequently, using ICA or FA reduction methods becomes a better solution.

Conclusions
This paper compares three methods for reducing the dimensionality of a type of spectral-temporal data, called MPEG-7 ASD. The simulation results show that the PCA reduction method cannot effectively preserve important components for distorted items. On the other hand, ICA and FA reduction approach can better preserve important information of moderately distorted items. In conclusion, to extract low-dimensional features from distorted spectral-temporal data, the ICA or FA-based approaches are better choices than the PCA approach.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to copyright considerations.

Conflicts of Interest:
The authors declare no conflict of interest.