Individual Violin Recognition Method Combining Tonal and Nontonal Features

: Individual recognition among instruments of the same type is a challenging problem and it has been rarely investigated. In this study, the individual recognition of violins is explored. Based on the source–ﬁlter model, the spectrum can be divided into tonal content and nontonal content, which reﬂects the timbre from complementary aspects. The tonal / nontonal gammatone frequency cepstral coe ﬃ cients (GFCC) are combined to describe the corresponding spectrum contents in this study. In the recognition system, Gaussian mixture models–universal background model (GMM–UBM) is employed to parameterize the distribution of the combined features. In order to evaluate the recognition task of violin individuals, a solo dataset including 86 violins is developed in this study. Compared with other features, the combined features show a better performance in both individual violin recognition and violin grade classiﬁcation. Experimental results also show the GMM–UBM outperforms the CNN, especially when the training data are limited. Finally, the e ﬀ ect of players on the individual violin recognition is investigated.


Introduction
Musical instrument recognition is a process to identify the type of musical instrument from the audio, which can be achieved through timbre analysis between different instruments. It has many practical applications, such as music information retrieval, audio content analysis and automatic music transcription.
In developing musical instrument recognition, two promising and challenging directions have emerged recently. One is applying the musical instrument recognition to the real-life situation. Instead of the isolated notes [1,2], more recent research deals with solos and multi-instrument music [3][4][5]. Another is restricting the instrument models to specific types and making a refined classification. Much research has been done to distinguish similar instruments within a family. For example, Banerjee et al. examined features and classification strategies for identifying four instruments of the string family [6]. Fragoulis et al. explored the discrimination between piano notes and guitar notes [7]. Avci et al. studied the machine learning-based classification of violin and viola sounds in the same notes [8]. Lukasik investigated the identification of individual instruments of the violin family [9].
The aim of this study is to tackle the problem of individual instrument recognition of the same type on the solo phrases. Owing to the same structure and shape, these individuals have similar timbre that is hard to be distinguished even by the experienced musicians. According to the results of individual recognition, the targeted individual will be identified among many instruments with identical appearance. For example, the solos played by the world-famous individual can be retrieved in the music database automatically. This can also help people search for the lost instruments through the audio. Furthermore, the similarity of timbre between individuals can be derived from the recognition results, which is an essential basis to classify quality grade of the instruments.
We choose violins for the individual recognition research. The violin was first known in 16th-century, and it is an important instrument in a wide variety of musical genres. Many researchers have sought explanations for the difference of the violins by investigating varnish and wood properties, plate tuning systems and the spectral balance of the radiated sound [10]. Violin performers were also required to rate violins for playability, articulation and projection [11]. However, acoustics and players are unable to provide an absolutely reliable standard for identifying and evaluating the violins. For example, even the experienced players cannot distinguish the individual of the Stradivari from other violins under the double-blind conditions [12]. The individual recognition of violin in our research may help address these issues.
In this study, the timbre analysis of the violin is investigated based on the source-filter model. A new set of the features are proposed, which are extracted using the tonal content and nontonal content. Inspired by speaker verification [13], a system based on Gaussian mixture models-universal background model (GMM-UBM) is built to recognize different violins. A solo dataset of violin is created in order to evaluate the recognition task of violin individuals. In addition to individual violin recognition, the experiments on grade classification of violins are conducted and players' influence on individual recognition is discussed as well.
The remainder of this study is organized as follows: Section 2 reviews the related works. Section 3 describes the source-filter model of the violin. Based on the model, the tonal and nontonal features are extracted in Section 4. Section 5 introduces the GMM-UBM recognition system. In Section 6, the solo dataset created for violin recognition is described. The experimental results and performance analysis are presented in Section 7. Finally, we draw the conclusions in Section 8.

Related Works
The traditional approach of instrument recognition is to extract acoustic features from the music signal and classify them using pattern recognition algorithms [14]. The existing feature sets include perception-based, temporal, spectral and timbral features [15]. For example, Essid et al. dealt with real music using autocorrelation coefficients, amplitude modulation features, mel-frequency cepstral coefficients (MFCC), spectral centroid, spectral slope, frequency derivative of the constant-Q coefficients, etc. [16]. In [17], the performance of 13 features was compared, including MFCCs, MPEG-7 features [18] and perception-based features. Among the individual feature schemes, the MFCC feature scheme gave the best classification performance. Experiments in [19,20] also favored the MFCCs over other features. Inspired by the MFCC, Duan et al. proposed the mel-scale uniform discrete cepstrum as the feature, which can model the timbre of mixture music [21]. For the audio signals with multiple sources, Costa et al. presented a sparse time-frequency feature by combining different instances of the fan-chirp transform [22].
In addition to the feature extraction module, there has been considerable interest in the classification module. For example, Diment et al. [23] trained a Gaussian mixture model (GMM) for the combination of the MFCCs and phase-related features. As the GMM cannot model dynamic evolution of the features, Zlatintsi et al. explored the hidden Markov model (HMM) in classification task [24]. In [25], a pairwise strategy was used in the classification task and the SVM was shown to outperform the GMM. Joder et al. conducted a large number of experiments to assess the impact of temporal integration on instrument recognition systems, and the classifiers consisted of the SVM, GMM and HMM [26]. Recently, Yip and Bittner made an open-source solo classifier using random forests for better performance [27].
More recently, deep learning techniques have been increasingly used owing to their superior performance. Han et al. used a convolutional neural network (CNN) to recognize predominant instrument in mixture music, which achieved higher accuracy than the conventional methods using SVMs [4]. In [28], Hilbert spectral feature and MFCCs were employed as inputs of a CNN to classify the predominant instrument at different time intervals. Koszewski et al. proposed a CNN-based automatic instrument tagging method, which provided promising recognition scores even for noisy recordings [29]. Hung et al. explored the CNNs with constant-Q coefficients and harmonic series feature for frame-level instrument recognition and pitch estimation [30,31]. Yu et al. also proposed to construct a CNN in a pattern of multitask learning, which used the auxiliary classification to assist the instrument classification [32]. Deep architectures can also implement feature extraction and classification in an end-to-end manner, which outperformed the traditional two-stage architectures in many tasks of recognition. For example, Li et al. showed that feeding raw audio waveforms to a CNN achieved 72% F-micro in discriminating instruments, whereas the MFCCs and random forest only achieved 64% [33].
The performance of instrument recognition approaches depends a lot on the annotated data, especially the deep learning methods. There are some efforts in the direction of datasets, such as the MUMS [34], RWC [35], ParisTech [26], UIOWA [36], MedleyDB [37], IRMAS [38] and MusicNet [39]. Currently, the annotated dataset for individual recognition is quite limited. In this study, we focus on individual violin recognition and create a dataset of violin individuals.

Source-Filter Model of Violin
The source-filter model originated from speech production has been used for decades in speech coding and synthesis. Similar to the speech signal, the music signal can be modeled as a combination of excitation and resonator [40]: where t denotes time, x(t) stands for the music signal, p(t) the excitation friction, h(t) the impulse response of resonator. Here "source" refers to vibrating strings and "filter" represents the resonance structure of the rest of violin. When a string vibrates, the bridge rocks and transmits the vibration to the resonance box. Then each component is amplified according to the resonance generated at that frequency. In the source-filter model, the excitation and resonator both determine the unique timbre of each violin.
The harmonic sequence is a typical structure in the spectrum of the music signal. The pitch is what is perceived as the tone, and its value can be determined according to the positions of harmonics for the violin. If two violins are playing the same note, the positions of two harmonic sequences will be the same. Under this simplified assumption, it will therefore be the envelope that makes the two sounds different. Hence, the timbre feature is often obtained through extracting the shape of the spectral envelope.
According to the sinusoids plus noise model, the pitched sound can be modeled as a sum of harmonics and noise residual [41]. The residual part is often ignored because of its low energy. In the "source" of the violin, the stable harmonics result from the main modes of vibration of the strings. The residual noise is generated by the sliding of the bow against the string, plus by other nonlinear behavior in excitation. Therefore, an improved source-filter model based on the sinusoidal plus residual decomposition is proposed [42]. The music signal x(t) can be modeled by where A r (t) and θ r (t) are the instantaneous amplitude and phase of the rth sinusoid in the excitation, respectively, e(t) is the noise component in the excitation, and h(t) is the impulse response of resonator. According to Equation (2), the music signal can be divided into two parts, the tonal component x T (t) and the nontonal component x N (t). As shown in Equations (3) and (4), each part is produced by its own excitation, but the resonator is same.
Electronics 2020, 9, 950 4 of 18 The approximate process in frequency domain is illustrated in Figure 1. Considering the significance of the harmonics, the spectral envelope extracted from the music signal x(t) is similar to the envelope of the tonal part in Figure 1d. As shown in Figure 1d, the shape of the spectral envelope is determined at the positions of partials, which is equivalent to a sampled envelope. The noise e(t) is a stochastic signal and the spectral envelope of the nontonal component contains more comprehensive information about resonator. In Figure 1, the envelope shape of the nontonal part has big difference with that of the tonal part, which captures complementary information of timbre.
Electronics 2020, 9, x FOR PEER REVIEW 4 of 18 the envelope of the tonal part in Figure 1d. As shown in Figure 1d, the shape of the spectral envelope is determined at the positions of partials, which is equivalent to a sampled envelope. The noise e(t) is a stochastic signal and the spectral envelope of the nontonal component contains more comprehensive information about resonator. In Figure 1, the envelope shape of the nontonal part has big difference with that of the tonal part, which captures complementary information of timbre.

Tonal/Nontonal Content Extraction
The tonal part has been widely used in musical instrument classification by identifying the harmonic series from spectrum [17,20,43]. On the contrary, the research on nontonal content is quite limited [44,45]. The nontonal part refers to the non-harmonic residual, which is often used to describe the frequency components located between the partials [8].
We extract the tonal and nontonal contents in the frequency domain. The multiple pitch estimation algorithm based on harmonic product spectrum [46] is used to obtain the fundamental frequency of the music signal. For the fundamental frequency f0, the position of the nth harmonic is around nf0. Around each of these harmonic peaks, a region is defined to cover the peak. Here the region width of each peak is set to 0.3f0, so the nth harmonic peak interval is [(n − 0.15)f0, (n + 0.15)f0]. Figure 2 shows an isolated tonal lobe of a violin note, whose fundamental frequency is about 329 Hz. By resaving the spectral regions in each harmonic peak interval, all the tonal lobes are obtained.
The nontonal content can be extracted by eliminating the tonal lobes from the spectrum. First, all the amplitude values in each harmonic peak interval are zeroed. Then the zero-magnitude points in spectrum will be spline interpolated. After the moving operation on the whole spectrum, a curve of nontonal content is obtained.

Tonal/Nontonal Content Extraction
The tonal part has been widely used in musical instrument classification by identifying the harmonic series from spectrum [17,20,43]. On the contrary, the research on nontonal content is quite limited [44,45]. The nontonal part refers to the non-harmonic residual, which is often used to describe the frequency components located between the partials [8].
We extract the tonal and nontonal contents in the frequency domain. The multiple pitch estimation algorithm based on harmonic product spectrum [46] is used to obtain the fundamental frequency of the music signal. For the fundamental frequency f 0 , the position of the nth harmonic is around nf 0 . Around each of these harmonic peaks, a region is defined to cover the peak. Here the region width of each peak is set to 0.3f 0 , so the nth harmonic peak interval is [(n − 0.15)f 0 , (n + 0.15)f 0 ]. Figure 2 shows an isolated tonal lobe of a violin note, whose fundamental frequency is about 329 Hz. By resaving the spectral regions in each harmonic peak interval, all the tonal lobes are obtained.
Electronics 2020, 9, x FOR PEER REVIEW 4 of 18 the envelope of the tonal part in Figure 1d. As shown in Figure 1d, the shape of the spectral envelope is determined at the positions of partials, which is equivalent to a sampled envelope. The noise e(t) is a stochastic signal and the spectral envelope of the nontonal component contains more comprehensive information about resonator. In Figure 1, the envelope shape of the nontonal part has big difference with that of the tonal part, which captures complementary information of timbre.

Tonal/Nontonal Content Extraction
The tonal part has been widely used in musical instrument classification by identifying the harmonic series from spectrum [17,20,43]. On the contrary, the research on nontonal content is quite limited [44,45]. The nontonal part refers to the non-harmonic residual, which is often used to describe the frequency components located between the partials [8].
We extract the tonal and nontonal contents in the frequency domain. The multiple pitch estimation algorithm based on harmonic product spectrum [46] is used to obtain the fundamental frequency of the music signal. For the fundamental frequency f0, the position of the nth harmonic is around nf0. Around each of these harmonic peaks, a region is defined to cover the peak. Here the region width of each peak is set to 0.3f0, so the nth harmonic peak interval is [(n − 0.15)f0, (n + 0.15)f0]. Figure 2 shows an isolated tonal lobe of a violin note, whose fundamental frequency is about 329 Hz. By resaving the spectral regions in each harmonic peak interval, all the tonal lobes are obtained.
The nontonal content can be extracted by eliminating the tonal lobes from the spectrum. First, all the amplitude values in each harmonic peak interval are zeroed. Then the zero-magnitude points in spectrum will be spline interpolated. After the moving operation on the whole spectrum, a curve of nontonal content is obtained.   The nontonal content can be extracted by eliminating the tonal lobes from the spectrum. First, all the amplitude values in each harmonic peak interval are zeroed. Then the zero-magnitude points in spectrum will be spline interpolated. After the moving operation on the whole spectrum, a curve of nontonal content is obtained. Figure 3 shows the original spectrum of a violin note E4 and its envelopes. The fundamental frequency of note E4 is about 329 Hz and the corresponding harmonic structure is obvious in Figure 3. The envelope of tonal content is similar to that of original spectrum. There is a difference between the envelopes of original spectrum and nontonal content. Compared with the envelopes of original spectrum and tonal content, several new peaks appear in the nontonal content, such as an obvious one located at 2000 Hz. Therefore, it is feasible to extract complementary timbre information from tonal content and nontonal content.
Electronics 2020, 9, x FOR PEER REVIEW 5 of 18 Figure 3 shows the original spectrum of a violin note E4 and its envelopes. The fundamental frequency of note E4 is about 329 Hz and the corresponding harmonic structure is obvious in Figure  3. The envelope of tonal content is similar to that of original spectrum. There is a difference between the envelopes of original spectrum and nontonal content. Compared with the envelopes of original spectrum and tonal content, several new peaks appear in the nontonal content, such as an obvious one located at 2000 Hz. Therefore, it is feasible to extract complementary timbre information from tonal content and nontonal content.

Feature Extraction
Based on the source-filter model, the cepstral coefficients are usually used to parameterize the spectral envelope. The MFCC emphasizes perceptually meaningful frequencies using Mel scale and provides a more compact representation than the cepstral coefficients.
Similar to the MFCC, the gammatone frequency cepstral coefficient (GFCC) is derived using gammatone filters with equivalent rectangular bandwidth bands (ERB) [47]. The gammatone filterbank could model the human cochlear filtering better than the Mel filterbank. In speech recognition [48], speaker identification [49] and music retrieval [50], the GFCC performs substantially better than the conventional MFCC. Many researches also indicate that the GFCC exhibits superior noise robustness to the MFCC [51,52].
The impulse response of a gammatone filter centered at frequency f is given by where t denotes time, a is the filter order; rectangular bandwidth b increases with the center frequency f. With the designed gammatone filterbank, a time-frequency representation can be obtained from the outputs of the filterbank. Taking the log operation on the power spectrum and applying discrete cosine transform (DCT) on the log spectrum, the GFCC extraction is finished. In this study, two timbre features named tonal GFCC and nontonal GFCC are introduced. These two features are extracted as in Figure 4. A tonal/nontonal content extraction block is inserted after the traditional FFT. Here the gammatone filterbank is modified by attenuating the "tails" of the response further away from the filter's center frequency.

Feature Extraction
Based on the source-filter model, the cepstral coefficients are usually used to parameterize the spectral envelope. The MFCC emphasizes perceptually meaningful frequencies using Mel scale and provides a more compact representation than the cepstral coefficients.
Similar to the MFCC, the gammatone frequency cepstral coefficient (GFCC) is derived using gammatone filters with equivalent rectangular bandwidth bands (ERB) [47]. The gammatone filterbank could model the human cochlear filtering better than the Mel filterbank. In speech recognition [48], speaker identification [49] and music retrieval [50], the GFCC performs substantially better than the conventional MFCC. Many researches also indicate that the GFCC exhibits superior noise robustness to the MFCC [51,52].
The impulse response of a gammatone filter centered at frequency f is given by where t denotes time, a is the filter order; rectangular bandwidth b increases with the center frequency f. With the designed gammatone filterbank, a time-frequency representation can be obtained from the outputs of the filterbank. Taking the log operation on the power spectrum and applying discrete cosine transform (DCT) on the log spectrum, the GFCC extraction is finished. In this study, two timbre features named tonal GFCC and nontonal GFCC are introduced. These two features are extracted as in Figure 4. A tonal/nontonal content extraction block is inserted after the traditional FFT. Here the gammatone filterbank is modified by attenuating the "tails" of the response further away from the filter's center frequency.
f. With the designed gammatone filterbank, a time-frequency representation can be obtained from the outputs of the filterbank. Taking the log operation on the power spectrum and applying discrete cosine transform (DCT) on the log spectrum, the GFCC extraction is finished.
In this study, two timbre features named tonal GFCC and nontonal GFCC are introduced. These two features are extracted as in Figure 4. A tonal/nontonal content extraction block is inserted after the traditional FFT. Here the gammatone filterbank is modified by attenuating the "tails" of the response further away from the filter's center frequency.

Individual Violin Recognition System
The recognition system is built using a UBM for general violin representation and maximum a posteriori (MAP) adaption to derive individual violin models from the UBM. The basic components of the GMM-UBM recognition system are shown in Figure 6.

Individual Violin Recognition System
The recognition system is built using a UBM for general violin representation and maximum a posteriori (MAP) adaption to derive individual violin models from the UBM. The basic components of the GMM-UBM recognition system are shown in Figure 6.

Individual Violin Recognition System
The recognition system is built using a UBM for general violin representation and maximum a posteriori (MAP) adaption to derive individual violin models from the UBM. The basic components of the GMM-UBM recognition system are shown in Figure 6.  As a front-end processing stage in the system, the music is segmented into frames by a short window of 93 ms and features are extracted from the frames. The UBM is a single GMM trained to represent the individual-independent distribution of features. A number of solos from different violins are pooled as the background music data and a sequence of feature vectors are extracted for training a UBM. For the sequence of feature vectors Y = {y 1 , y 2 ,..., y T }, the likelihood function of the UBM is defined as where p(y t λ UBM ) is the weighted linear combination of M unimodal Gaussian densities: In the density, c UBM  [53].
Then a hypothesized violin model is trained using training data of the specified individual. Unlike the standard training of a model for the specified violin directly, our basic idea is to derive the violin's model by updating the parameters in the UBM via a MAP adaption. In this case, the components of the adapted GMM retain correspondence with the mixtures of the UBM.
Given the training feature vectors Z = {z 1 , z 2 , . . . , z T } from the specified individual, the details of adaptation are as follows. Similar to the EM algorithm, we first compute the posterior probability for the mixture m in the UBM: where l is a latent variable to represent the index of Gaussian components. Then the sufficient statistics for the weight, mean and variance of the mth Gaussian component of the feature vector z t are computed as follows: Next, the hypothesized violin model is derived from the UBM. The parameters of the mth mixture are updated using these new sufficient statistics: where α m , β m , γ m are adaption coefficients for the weight, mean and variance, respectively. Coefficient ρ is a scale factor computed over all adapted weights c vn m to ensure they are summed to unity. To simplify the process of adaption, all the Gaussian components in the individual violin model share the same variance and weight parameters with the UBM. Only the mean µ vn m is updated here. Defining the mean adaption coefficient β m as: where r is a relevance factor in the range of 8 to 20, the mean vectors of the hypothesized violin model can be updated as: For the hypothesized violin model λ vn = c UBM m , µ vn m , Σ UBM m |m = 1, 2, . . . , M , the log-likelihood ratio for test sequence W = {w 1 , w 2 , . . . , w T" } is computed as the test score: Finally, a decision threshold is determined for accepting or rejecting the hypothesized violin model. The corresponding violin will be selected as a candidate when its score exceeds the threshold. The scores of all candidate models will be sorted and the highest one is the optimal recognition result.

Violin Dataset for Individual Recognition
A solo dataset of violin is developed in order to evaluate the individual violin recognition system. The dataset consists of solo recordings from 86 violins with various characteristics. According to the raw material, production process and tonal quality, the violins are divided into low-grade, medium-grade and high-grade, respectively. The dataset includes violins of the three grades, whose prices range from 100 to 20,000$. More details of the violin dataset are given in Table 1. A total of 68 violin performers participated in the recoding. The performers have 3 to 55 years of experience on playing violin. Each performer played 3 or 4 excerpts of different content using the same violin. The music excerpts cover the classical music, popular music and Chinese folk music. For each violin, the solos of same content were performed by different players. For example, there are Electronics 2020, 9, 950 9 of 18 60 medium-grade violins and each violin was played by 3 performers. In this case, each player performed 4 music excerpts, including Turkish March, Humoreske, Can You Feel My Love and Early Spring. It should be noted that the number of performers for high-grade violins is quite limited, because each valuable violin was only played by its owner.
All the solo excerpts were played indoor in a quiet environment. The audio files were recorded at 44.1 kHz sampling frequency with a resolution of 32 bits per sample. After the recording sessions, all the music data were reviewed and processed to remove silence at the beginning and end of the recordings. Each solo excerpt lasts about 3 min and the 1070 files contain more than 50 h of audio signals.

Individual Violin Recognition
In order to obtain a general violin representation independent of the dataset, 140 solos of unknown violins were downloaded from the Internet as the background data. All files in the violin dataset were divided into two groups referred to as the training data and test data. Three partitioning ways of data are shown in Table 2. For each violin, at least one solo file was randomly chosen as the training set. In this way, 86 violin models were derived. The remaining solo excerpts for each individual were collected as the test set. All the music data were segmented into the frames by a 93 ms window progressing at a 46.5 ms frame rate. The 40-dimensional tonal GFCC and 40-dimensional nontonal GFCC were concatenated as the combined features, which were fed into the recognition system. In the GMM-UBM system, the models of 64 Gaussian components were trained. The relevance factor r was fixed at 16 in the process of adaption.
We also carried out the experiments using other features and classifiers. The features for comparison contain the MPEG-7, linear prediction cepstral coefficients (LPCC) and MFCC, which are widely used in instrument classification. Specifically, the MPEG-7 set includes seven feature descriptors: harmonic centroid, harmonic deviation, harmonic spread, harmonic variation, spectral centroid, log attack time and temporal centroid.
The classifiers used for comparison are GMM and CNN. For each violin, a GMM of 64 Gaussian components was trained via the EM algorithm. The CNN model was based on the Inception-v3 [54]. Its architecture and parameters are shown in Appendix A. The CNN took 15-frame features as the input. For the n-dimensional feature, the dimension of input was n × 15 × 1. The CNN had 86 units in the output layer, corresponding the 86 violins. We also trained a CNN in an end-to-end manner, which employed the raw audio waveforms as input. The CNN had the same architecture and parameters as the sample-level network given in [55]. The raw waveform input was set to 59,049 samples (2678 ms at 22.05 kHz sampling frequency), and the dimension of output was 86.
Considering that the training set were chosen randomly, all the experiments were repeated 10 times to obtain the average. The accuracy of individual violin recognition is shown in Table 3. The accuracy is the fraction of correct solo excerpts among all the test excerpts, which is equal the micro-averaged measure F-micro in this multi-label problem. We also used the macro-averaged measure F-macro [56] to evaluate the performance of recognition across 86 classes. The F-macro of individual violin recognition is presented in Table 4. Among all the extracted features in Tables 3 and 4, the combined features perform best regardless of the type of classifier and the amount of the training data. The performance of the GMM-UBM system is superior to that of the GMM system. The UBM can model the individual-independent distribution of the features using background data, so the GMM-UBM performs better. As the number of training samples increases, the differences between the performance of the GMM and GMM-UBM become smaller. When the training data are more abundant, the superiority of the UBM is less obvious. In "Train1" and "Train2", the performance of the CNN system is inferior to that of the two GMM-based systems. This is largely due to the small amount of training data limits the performance of the CNN. In "Train3", the performance of the CNN system is still unsatisfactory. One of the reasons is that the training data of "Train 3" is class-imbalanced. For example, there are only 3 solos for each high-grade violin in the dataset. The training data for each high-grade violin are less than other violins. We can also observe that the end-to-end CNN system's performance is similar to or even superior to that of the CNN with the combined features. This indicates that the CNN could learn some effective representations from the raw waveform audio signals.
On the metrics of accuracy and F-macro, we further performed the paired t-test to compare the performance of proposed features and MFCC. The p-values are presented in Table 5. Most of the p-values are smaller than 0.05, which demonstrates that the superiority of the combined features to the MFCC is statistically significant.

Violin Grade Classification
Considering the grade information instead of the individual label, the classification of violin grades could be implemented similarly. The system's output was three grades rather than 86 individuals. With the same partitioning of the training and test sets given in Table 2, we carried out the classification experiments. The accuracy and F-macro of violin grade classification are presented in Tables 6 and 7, respectively. According to the results in Tables 6 and 7, we can draw a conclusion similar to that in individual violin recognition. For each classifier, the combined features lead to better performance than other features. The GMM-UBM's superiority to the GMM and CNN is still obvious. Compared with the results shown in Tables 3 and 4, all the systems perform better in violin grade classification than individual violin recognition. The grade of violins is a broader category than individual, which is responsible for better performance. We could also observe a big difference between the values of accuracy and F-macro in Tables 6 and 7. The accuracy aggregates the contributions of all test solo excerpts to compute a metric, whereas the F-macro computes the metric independently for each class and then take the average. When the classification results of the three grades differ a lot, there is an obvious difference between the overall accuracy and F-macro.
In the violin dataset, 60 violins belong to the medium grade and the solos in this grade account for 67% of the total data. For the majority grade, the accuracy of violin grade classification is presented in Table 8. Compared with the results shown in Table 6, all the systems obtain higher accuracy metrics on the medium-grade data. This is largely due to the training data of the medium grade are more sufficient than other grades. In "Train3", the CNN performs better than the two GMM-based systems on the classification of medium-grade violins. We believe that the CNN can be a promising model when the training data are sufficient and balanced. A paired t-test was also conducted to compare the performance of the classification systems using the combined features and MFCC. As shown in Table 9, the p-values demonstrate that the superiority of the combined features to MFCC is statistically significant.

Effect of Performer
The effect of the violin performer was also discussed in the individual violin recognition systems using the combined features. According to the label information of the player, two set of experiments were conducted. In the same-performer scheme, the training data and test data for each violin were performed by the same player. In the different-performer scheme, the training data and test data for each violin were from different players. The corresponding ways of data partitioning are shown in Table 10. The recognition accuracy rate (%) of the two schemes are presented in Table 11. There is a huge divide between the results of the same-performer scheme and the different-performer scheme. The accuracy rate has a prominent improvement when all the training data and test data are played by the same person. This indicates that the classifiers are overfitted to some extent. The classifiers are too closely fit to the limited training set, so the same-performer scheme obtains a better performance. In the comparison of the results between the two performer-based schemes, the difference of the GMM-UBM system is smaller than that of the GMM system. This is due to that the UBM could model the performer-independent distribution of the features using background data, which mitigates the overfitting of GMM. For each violin in the dataset, all solos played by the specific performer are of different contents. The high accuracy of the same-performer scheme also shows that, the influence of the performer is more prominent than the solo content in the recognition experiments.

Conclusions
This paper proposed the combined features consisting of tonal GFCC and nontonal GFCC, which could capture more comprehensive information about timbre. Utilizing the proposed features, a framework based on the GMM-UBM was built to identify different violins. The framework employed a UBM to represent the violin-independent distribution of the combined features and a MAP adaption to derive individual violin models.
In order to evaluate the performance of individual violin recognition, a solo dataset consisting of 86 violins was developed. Among all the extracted features in this paper, the proposed features performed best in both individual violin recognition and violin grade classification. The GMM-UBM's superiority to the CNN was more obvious when the size of training set was smaller. The success of deep learning often hinges on the availability of sufficient training data, and the small amount of training data limited the performance of the CNN. The UBM could model the individual-independent distribution of the features using background data, so the GMM-UBM performed better with limited training data. Considering the great influence of players, the performer-independent individual recognition and a more robust model are promising in the future.   (a) Figure A1. Diagram of the Inception network.