Enhancement of Conventional Beat Tracking System Using Teager–Kaiser Energy Operator †

: Beat detection systems are widely used in the music information retrieval (MIR) research ﬁeld for the computation of tempo and beat time positions in audio signals. One of the most important parts of these systems is usually onset detection. There is an understandable tendency to employ the most accurate onset detector. However, there are options to increase the global tempo (GT) accuracy and also the detection accuracy of beat positions at the expense of less accurate onset detection. The aim of this study is to introduce an enhancement of a conventional beat detector. The enhancement is based on the Teager–Kaiser energy operator (TKEO), which pre-processes the input audio signal before the spectral ﬂux calculation. The proposed approach is ﬁrst evaluated in terms of the ability to estimate the GT and beat positions accuracy of given audio tracks compared to the same conventional system without the proposed enhancement. The accuracy of the GT and average beat differences (ABD) estimation is tested on the manually labelled reference database. Finally, this system is used for analysis of a string quartet music database. Results suggest that the presence of the TKEO lowers onset detection accuracy but also increases the GT and ABD estimation. The average deviation from the reference GT in the reference database is 9.99 BPM (11.28%), which improves the conventional methodology, where the average deviation is 18.19 BPM (17.74%). This study has a pilot character and provides some suggestions for improving the beat tracking system for music analysis.


Introduction
Onset time in audio signal analysis represents the time position of a relevant sound event: usually when a music tone is created. Onset detection functions are algorithms that capture onsets (onset time positions), and thus ideally all tones in audio recordings. They can create a representation or an evolution of onset structure in given time of particular audio recording. There are also offsets of tones (indicating the end time position of a tone in a signal), e.g., see [1,2], but beat tracking systems do not need such information to work properly. The conventional beat tracking system is usually based on the calculation of repetitiveness of the dominant components in an onset function (onset curve) and its output represents a temporal framework, i.e., time instances, where a person would tap when listening to the corresponding piece of music. That is why it is important to have a robust and computationally effective onset detector. Calculation of the beat positions and global tempo (GT) is important for musicologists and the complex music analysis. With such automated systems, tempo and agogic changes can be measured much faster than only with manual approach alone. Thus, musicologists would have to spend less time correcting calculated beat positions. Therefore, we set a new parameter-the average deviation of reference beat positions to the calculated beat positions as the average beat deviation (ABD).
Most of the onset detectors are based on energy changes in spectra: the calculation of spectral flux. For bowed string instruments there is a method called SuperFlux that can suppress vibrato in an expressive performance and reduce the amount of false-positive detections [3]. Some methods enhance the spectral flux onset detection using logarithmic spectral compression and then compute the cyclic tempogram for a tempo analysis [4]. There is also a method that calculates tempograms using Predominant Local Pulse [5]. Besides, the onset detection and beat detection could be performed in several toolboxes and libraries such as Tempogram Toolbox [6], LibROSA [7], MIR Toolbox [8], etc. [9]. The state-of-art onset detectors are usually based on deep neural networks [10,11] using spectral components and parameters as their inputs. Beat detection systems contribute from the solid onset detectors, where periodicity is identified [6,8,[12][13][14]. As in other MIR fields, neural networks are also used.
While onset detection in percussive music is considered to be highly accurate (already at MIREX 2012 conference [15], algorithms achieved F-measure values greater than 0.95 for percussive sounds), detection of soft onsets produced by bowed string or woodwind instruments is still challenging. Although a lot of improvements in onset detection have been made, no system is truly universal for all musical instruments and all types of music.
This work aims to enhance the conventional beat tracking system and to improve the tempo analysis methodology published in [16,17] using the more sophisticated approach of tempo structure creation based on the automated beat tracking system with the Teager-Kaiser energy operator (TKEO) included. This nonlinear energy operator is used, e.g., for the improvement of onset detection in EMG signals (electromyography) [18], to decompose audio into amplitude and frequency modulation components [19], for the detection of Voice Onset Time [20], or the highly efficient technique for LOS estimation in WCDMA mobile positioning [21]. So far there is no extensive study on the use of TKEO for the analysis of musical instruments.
Since we will focus on the detection of onsets of melody instruments with low-energy attacks, we will concentrate on the onset and beat detection method based on spectral changes. We have not chosen probabilistic models, because they are usually susceptible to noisy recordings, which can be a problem in the case of old recordings.
The rest of the paper is organised as follows: Section 2 describes the onset detection function, the Teager-Kaiser energy operator, the proposed enhancement of the conventional beat tracking system and the beat detection method. It shows, how is the TKEO changing the spectra and therefore the output onset detection. Then, it introduces the reference and the string quartet database used for the GT and ABD estimation. Furthermore, a possible application is shown and the system evaluation is defined. Results are reported in Section 3 and discussed in Section 4. Finally, conclusions are given in Section 5.

Onset Detection
Usually, onset detection algorithms use some pre-processing steps to reduce redundant information and to improve detection accuracy. In this study, we propose a new method of pre-processing based on the TKEO. The TKEO (Ψ{s(t)}) is a nonlinear energy operator that can be calculated using the following formula: i.e., we compute the square of the first derivative (which denotes the square of the rate of signal change) and then subtract the signal multiplied by the second derivative (which determines the acceleration at that point). We speed up the temporal changes of the signal module by removing the slow changes because we consider the rate of change. It is known that the faster the time changes, the higher the frequency components appear in the spectrum. By taking the first derivative into account, we increase the magnitude of higher frequencies of the spectrum [22]. In our discrete approach, we firstly downsample the input signal x[n] to 22,050 Hz. Next, we apply the TKEO, i.e., we calculate the corresponding discrete non-causal form: which creates an energy profile of the given audio sample. In comparison to the conventional squared energy operator, the TKEO takes into account also signal's frequency [23] and it can have negative values, e.g., see Figure 1. Differences in spectra for the same audio track (clarinet recording) are shown in Figure 2. It is interesting how the dominant spectral components have changed-the clarinet has naturally strong odd harmonics, but the TKEO has changed their magnitude. In the following step, we calculate the onset envelope using the perceptual model. We use Short-Time Fourier Transform (STFT) with Hann window (hop-factor: 512 samples) and then the conversion to the perceptual model with log-power mel-frequency representation: 120 mel bands, max frequency at 10 kHz and min frequency at 27.5 Hz. We get the matrix |X[m, k]|, where m denotes the index of the frame and k the frequency bin or index of the mel band. These settings were inspired by SuperFlux calculation [3].
In the next step, we calculated the spectral flux. The basic version of spectral flux is defined as the l 1 -norm of consecutive frames [24]: for m = 0, 1, 2, . . . , M − 2, where H[x] = (x + |x|)/2 is the half-wave rectifier, M is the number of frames, and K is half of STFT frequency bins, or number of mel bands. A half-wave rectifier is used to set negative values to zero and positive differences are summed across all frequency bands. Spectral flux gives us information, how energy in spectra changes in time. Finally, a peak-picking function is applied (default LibROSA settings) to identify time positions of onsets and therefore new tones in the audio signal. An example of this system based on the mel-frequency representation, but without the use of TKEO, is shown in Figure 3. It represents a solo clarinet part. The onset function detected many false peaks and marked positions, where tones were not played. For comparison, Figure 4 shows the same signal, but in this case, pre-processed by the TKEO. The peak-picking function now marked all real onsets with better accuracy and without any false positive detection. The colorbar in dB ( Figure 5) is presented separately because of the proper alignment of a spectrogram and onset function but is the same for all spectrograms (produced by matplotlib package) in this paper.   As we can see on the second spectrogram (Figure 4), the energy in spectra changed, frequencies do not correspond properly to the original signal and new tones are sharpened and much more clear. We give this example for a good reason. Recording of a solo clarinet was the only audio track, in which the accuracy of the onset detection function was improved. Adding TKEO into this conventional detection method lowered the general detection accuracy. It decreased the number of detected false positives but also decreased the true positives. The cause of this phenomenon is explained in the following Section 2.2. We suggest that the general effect of the TKEO on onset detection function for woodwind instruments should be tested in more detail.

TKEO Influence
We applied the proposed method with the TKEO included on more recordings and observed, that in cases, where the tones are fast (e.g., violin playing thirty-second notes), or the energy difference is very low, this method does not detect every onset properly. Adding the TKEO increased the detection tolerance of fast changes in the signal. This means that the operator added additional "latency" to the signal values. It also decreased the ability of this system to capture low-energy spectral components. In general, fewer onsets were detected-only strong and more rhythmically important onsets remained. This is the advantage of the TKEO in the system. It suppresses less dominant spectral components and very fast tones even though onset detectors are usually set to do the opposite. Figures 6 and 7 show another analysed track-a violin solo in a very fast tempo. There is a clear difference in spectrograms for the described detector and the same detection with the TKEO included. Most of the tones are quite visible in the spectrogram of the first figure. However, the system with the TKEO has its changes in the spectrum vaguer and blurry which means that onset function detected a lower number of onsets (especially between the 1st and the 4th second of this track). In this case, the conventional system detected more onsets correctly but that still does not indicate that estimation of GT would be also more accurate.

Tempo Representation
To create a tempo structure of given recordings, we need a representation of tempo-how the density of onsets, or more precisely repetitiveness of significant onsets, is distributed. This can be done by several techniques, in this study we focused on the method of dynamic beat tracking system proposed in [12]. This system estimates beat positions in an onset envelope and uses them to pick the right peaks within a given interval (default tempo). The default tempo is set up before the calculation (or it is calculated automatically based on autocorrelation function with respect to the standard 120 BPM) and therefore it has to be estimated by listening to the particular audio track or estimated from the sheet music to work as we want. The calculated peak positions can deviate from the default tempo in adjustable boundaries (depends on settings, e.g., Ellis reports approximately 10% [12]). The parameter "tightness", which corresponds to the detection tolerance (from the default tempo), was set to the number 50 in all cases. At first, this looks like an inappropriate method for the varying tempo of string quartet music (second database), but with good parameterization and segmentation of particular motifs, it fits our need.
Beat detectors are based on a calculation of beats in an audio signal and therefore the metric structure from an elementary point of view. Usually, there is not enough information to consider dividing beats into bars without manual correction, but with proper segmentation, midi reference and dynamic time warping (DTW) techniques, this is possible [25]. However, one does not need such a method to calculate the GT of a given track. In this case, we only focused on the GT and ABD. Figure 8 shows how this system picks onset candidates from the onset curve and creates the beat positions by using periodicity information. Figure 9 shows the estimated time positions of beats at the beginning of a string quartet segment. As we can see, the system is using periodicity information to calculate beat positions even at places where no onsets are detected-in this specific part, second violin and viola are playing very quietly (and no onset is detected) and then a violin solo begins. Between the 6th and the 10th second of this track, there are strong onsets in the calculated onset curve. Their periodicity information is then used to fill the gap in the silent part of this recording, which is one of the advantages of the dynamic programming search system.
The disadvantage of such a beat tracking system is the adjustable default tempo-the algorithm searches for beat positions within a given interval, but there is no guarantee that true beat positions exist within specified limits (also concerning the tolerance parameter). The reference global tempo can be misleading if the recording is rhythmically unstable or the tempo changes significantly over time. A similar problem exists in the metric pulse. If the system detected 100 BPM as the GT and the reference is 50 BPM, it does not mean that the system is completely wrong. That is why we also calculated the ABD.

Dataset
First of all, we tested whether the TKEO improves the estimation of the GT in general. The GT is the median of differences between the time positions of beats throughout the whole analysed track. For this purpose, we used the SMC_MIREX database [26], which consists of different recordings, from classical pieces to guitar solos. The recordings are sampled by 44.1 kHz. Their annotations contain manually corrected beat time positions, which will be used as a reference.
Music by string quartets is very specific because the tempo can be more or less stable but the musical ornaments, intended gaps, fermatas, or other expressive musical attributes can be present. Every musician has her/his own style of agogic performance. If we define meaningful musical parts by choosing important musical motifs, we can create segments that could be processed separately.
The second dataset consists of 33 different interpretations of String Quartet No. 1 e minor "From My Life", composed by the Czech composer Bedřich Smetana. We also included two interpretations played by orchestra. We divided the first movement into six segments of musical motifs in the view of the musical meaning. The first movement consists of an introduction (Beg), exposition (A), coda (B), development (C), recapitulation (D) and the last coda (E). For every segment, we calculated the estimated average tempo (EAT), but without any expressive elements and information about beat positions, using a physical length of the tracks and information of rhythmic patterns in sheet music. The EAT will be used as a reference tempo for setting up the default tempo parameter in the beat tracking system. The first page of the sheet music is provided as an example in Appendix A.

Application
Beat tracking systems are used in the music analysis software for the complex tempo, timbre, dynamic or other music analysis. Example of such freeware software is Sonic Visualiser [27]. Figure 10 shows an example of tempo analysis of the string quartet music from the second tested database. The first pane is the visualisation of the audio wave, the second one is the spectrogram and the last one is a layer of manually corrected beat positions. Beat positions were calculated automatically by the beat tracking system called BeatRoot [28] (Vamp plugin) and then corrected by trained ears. The green line shows how tempo evolves in time-if the audio track is locally slowing down or the tempo increases. The method which is proposed in this paper has not been developed as a Vamp plugin for Sonic Visualiser. Musicologists can then draw conclusions from the measurement results. An automated beat tracking system is able to reduce the time of analysis significantly. For example, if we measure the EAT of the first motif of the second database for each recording, we get interesting results. One of the general assumptions is that presently we usually play the same piece of classical music faster than we did before. Figure 11 shows that this assumption may not be correct. There is a trend (see the slope of the linear regression line based on the sum of squares)-older recordings are on average at a faster pace. We do not have enough audio recordings to declare it as a fact, but the tendency is there. However, when we plot the EAT of the entire first movement (Figure 12), the tempo decrease is not so evident. Each black dot represents one interpretation and the blue line is a trend line. The sample from the year 1928 was an outlier and therefore we did not consider it in the regression analysis.

System Evaluation
During the analysis, we first used the reference dataset to determine the accuracy of the GT and ABD estimation. We computed the GT of each track by the proposed beat tracking method using both the proposed onset detection function (DS (default system)), and the same onset detection function with the TKEO (TS-system with the TKEO). Then we compared the reference values (annotation of the dataset) of each tested track with values estimated by the DS and the TS. The reference tempo was obtained as the number 60 (BPM definition) divided by the median of time differences between consecutive beat time positions. Then we calculated the median (Me) and the mean value (x) of time differences of consecutive beats in all recordings and also in which the average was less than 1 s. This represents the ABD of tracks that were close to the reference tempo (some recordings achieved more than~20 BPM difference in the GT when tested; they were excluded for the extended ABD testing).
Next, we analysed the string quartet database. First, all 33 recordings were divided into six segments with a relatively steady tempo and then all motifs were tested by the TS and the DS to estimate the GT. We computed the reference EAT of all segments of each interpretation (Table 1) by calculating the number of quarter notes ( Table 2) and dividing them by the time length of each recording. The complete table is in Table A1. Finally, the EAT and the computed GT were compared. Systems were implemented using Python language (especially NumPy and LibROSA packages).   Table 3 presents results of the GT detection based on the first database for the first 30 analyzed tracks. The complete table is in Table A2. Average deviation from the reference tempo was 9.99 BPM (11.28%) in the case of TS and 18.19 BPM (17.74%) in the case of DS. The least accurate estimation was done on the recordings of a solo acoustic guitar. We also applied the t-Test (Paired Two Sample for Means) for each system (compared to the reference). P-value for the TS is 0.038 and 0.024 for the DS (α = 0.05). Next, Table 4 presents general results of the GT testing: median, mean, standard deviation, relative standard deviation and variance for each tested system and the deviations from the reference tempo. The mean value of the reference GT was 76.78 BPM, the average computed GT 83.75 BPM for the TS and 88.97 BPM for the DS. Table 5 shows the mean value and the median of the ABD testing for all analyzed tracks. The average difference between consecutive beat time positions of the reference and the TS was 2.30 s and 2.84 s for the DS. Table 6 shows the average of the arithmetic mean and the median of time difference values of the recordings in which the ABD were less than 1 s. This means 11 recordings for the TS (37% of the tested database) and 9 recordings for the DS (30%). The TS detected the right metric pulse in more recordings than the DS. Average deviations from the reference beat positions were 0.39 s and 0.29 s for the TS and 0.95 s and 0.36 s for the DS respectively. Table 2 presents the length of the first movement of each motif of the second database and the corresponding number of quartet notes. Then, the EAT was calculated. Table 1 contains results based on the EAT of all motifs of our second database-33 different interpretations of String Quartet No. 1 e minor "From My Life". Finally, Table 7 shows the difference between the estimated GT and the EAT for both proposed systems. The complete table is in Table A3. The average deviation for the TS is 6.42 BPM and 6.59 BPM for the DS. Due to the nature of the results of the second dataset, no further statistical processing of the values was used.     Figure 13 shows differences between the reference GT and calculated GT of the TS and DS of the first database. The TS generally follows the reference tempo more accurately mainly because it more often determined the correct metric pulse. The DS shows greater local deviations of the GT from the tested tracks.

Discussion
Generally, the newly proposed method provided some improvements to the reference database. We analyzed 30 tracks and the results are reported in Table 4. The results suggest that the TKEO can help the proposed beat tracking system to pick better onset candidates for the beat positions and to slightly improve the GT calculation. The difference was about 8 BPM on average for all tested recordings of the first database. However, many recordings reported the same estimated GT for both methods. Then, the ABD was calculated. We used the reference database with manually corrected beat positions to determine the accuracy of both systems. We did not use F-measures, but rather average differences between consecutive beats. P values show that there is a difference between both systems. This gives us an idea of how close the beat tracking was to the reference positions. The system with the TKEO generally reported lower ABD for all settings used. The results suggest that the TKEO pre-processing improved the accuracy of the beat tracking system. This does not apply for the general onset detection function. Onset detection accuracy was reduced in most cases. The only exception was the recording of the clarinet.
As far as the string quartet database is concerned, the results were again slightly in favour of the system based on TKEO. All 33 recordings of the second database were tested. The difference between the average deviation from the EAT of the TS and the DS was only 0.17 BPM, and therefore both systems had more or less the same detection accuracy. We chose such complex music to see how the enhancement would deal with a very difficult task. The actual usefulness of the application also depends on the settings of selected parameters, not just on the TKEO pre-processing.
The idea of using TKEO in the pre-processing stage was to help the onset detection function to find more relevant onsets and therefore enhance the beat tracking system in terms of choosing better candidates for beat positions. It reduced the number of insignificant onsets detected. Onset detection accuracy has usually been reduced, but the final beat detection output may be more stable; the algorithm chooses from less and more important onsets. This is useful for analyzing tracks where we suspect a stable and non-agogic rhythm. We tested the effect of the TKEO to see how the output detection function would behave. We did not change the parameters such as tightness of the beat tracking system for each tested track; the correct setting (set for the particular piece of music) would yield better results for complex music analysis.
The limitation of this study is that the EAT in the string quartet database may be a reference value for the beat tracking system, but it is not the actual GT of a particular track since we cannot include any expressive elements in it. It does not provide any information about beat positions or local tempo changes. The same thing applies to the reference global tempo. In enhanced interpretation analysis, we need to track all beat positions in the segment and compare them to the real beat positions. However, in this case, we analysed relatively stable tracks with no abrupt tempo changes. In the future, we would like to use this system to create a database and its additional information about manually corrected beat positions of segmented string quartet music. The impact of the TKEO on audio recordings will be tested in more detail in our future work.
Cooperation between researchers and musicologists is the crucial part of such interdisciplinary projects and MIR science field. Different base knowledge and tendencies can lead to mutual misunderstandings, but both sides could benefit greatly from each other. Projects like these are the important bridge for computer scientists, MIR researchers and musicologists.

Conclusions
This study introduces an enhancement of the conventional beat tracking system by adding the TKEO into the pre-processing stage. It briefly describes the onset detection function and the beat tracking method with its possible application. The onset detection accuracy decreased in most analyzed tracks, but the accuracy of the GT detection and the ABD detection increased.
The influence of the TKEO was tested on different recordings and it was found, that in the case of woodwind instruments, the TKEO increased the onset detection accuracy. This phenomenon will be studied in our future work. We would like to focus on the possible applications of the TKEO on music recordings in general. The TKEO is changing the magnitude of frequency components in a signal and acts as a filter. This could be the cause of increased onset detection accuracy, e.g., for the clarinet example.
The estimation of the GT was improved in the reference database. The average deviation from the reference GT in the reference database is 9.99 BPM (11.28%), which improves the conventional methodology, where the average deviation is 18.19 BPM (17.74%). P-values indicate that there is a clear difference between proposed systems. Both systems were also tested on the string quartet database. In this case, however, the results are not convincing. The proposed TS will be further used in the subsequent music analysis of the string quartet database. The aim is to create an automated system for capturing beat positions that are as close as possible to the actual beat positions in the recordings even for the complex music such as string quartet. In this way, it is possible to minimize the time required for manual processing and labelling. This study has a pilot character and provides some suggestions for improving the beat tracking system for music analysis.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:  Figure A1. The beginning of the first movement.