Guitar Chord Sensing and Recognition Using Multi-Task Learning and Physical Data Augmentation with Robotics

In recent years, many researchers have shown increasing interest in music information retrieval (MIR) applications, with automatic chord recognition being one of the popular tasks. Many studies have achieved/demonstrated considerable improvement using deep learning based models in automatic chord recognition problems. However, most of the existing models have focused on simple chord recognition, which classifies the root note with the major, minor, and seventh chords. Furthermore, in learning-based recognition, it is critical to collect high-quality and large amounts of training data to achieve the desired performance. In this paper, we present a multi-task learning (MTL) model for a guitar chord recognition task, where the model is trained using a relatively large-vocabulary guitar chord dataset. To solve data scarcity issues, a physical data augmentation method that directly records the chord dataset from a robotic performer is employed. Deep learning based MTL is proposed to improve the performance of automatic chord recognition with the proposed physical data augmentation dataset. The proposed MTL model is compared with four baseline models and its corresponding single-task learning model using two types of datasets, including a human dataset and a human combined with the augmented dataset. The proposed methods outperform the baseline models, and the results show that most scores of the proposed multi-task learning model are better than those of the corresponding single-task learning model. The experimental results demonstrate that physical data augmentation is an effective method for increasing the dataset size for guitar chord recognition tasks.


Introduction
Automatic chord recognition (ACR) is a fundamental problem in the music information retrieval (MIR) field, wherein a chord is one of the key elements for understanding music. ACR remains a challenging problem because of the richness of the acoustic signals, broad types, and complexity of the music signal. The purpose of chord recognition is the automatic recognition of the chord progression in music recordings and its labeling in an appropriate form, such as A:maj, A:min, and B:sus. Popular applications of ACR can be found in music identification, music segmentation, and similar music recommendation systems. Recent ACR studies have shown that the chord recognition performance can be significantly improved using deep learning models, especially convolutional neural networks (CNNs) and recurrent neural networks (RNNs) [1,2]. Many ACR studies have focused on simple chord recognition [3,4].
However, recent works have addressed the large-vocabulary chord recognition problem. One of the challenges of large-vocabulary chord recognition is that the distribution of chord classes in a dataset is highly unbalanced; some classes of chords appear less frequently than other classes in a dataset, thereby rendering it difficult to recognize the chords.
An ACR system consists of three main components, namely feature extraction, classifiers, and chord sequence decoding. In the feature extraction part, an input audio signal is converted to its time-frequency representation using a short-time Fourier transform or frequency spectrogram. The log mel-spectrogram and constant Q transform (CQT) are widely used features in chord recognition [4,5]. Moreover, many research works have employed chroma representation in feature extraction [6,7].
Deep neural networks (DNNs) and CNNs are widely used for classification in audio music classification tasks [1,8]. In recent years, multi-task learning (MTL) models have been successfully applied in many applications of machine learning. MTL can be observed in situations where knowledge is often transferred among tasks that are related. For example, in the case of human experience, the skills required for playing the guitar and ukulele can help improve the playing ability of each instrument. Similar to human learning, it is effective to learn multiple learning tasks simultaneously as the knowledge in a task can be utilized by other related tasks. MTL aims to learn more than one related task at a time [9] from a dataset and achieve better performance and learning efficiency in each task. It can also be used to solve data scarcity problems. We hypothesize that learning chord roots and qualities individually can improve the performance of the chord recognition model.
In this study, we used two types of datasets, namely the GuitarSet [10] and physical augmented dataset, which was first proposed in [11]. The GuitarSet dataset includes audio recordings from an acoustic guitar with various annotations, such as the chord label, tempo, and beat. The proposed physical augmented dataset includes audio recordings and only chord labels. Guitar-playing robotics was developed to create an augmented dataset. Robots have the advantage that they can be created with any design. Therefore, the robot musician has the benefit of playability that is associated with dynamic variation and playing speed.
In this paper, we address the problem of the relatively large-vocabulary chord recognition task by introducing multi-task deep learning architecture that learns the chord roots and qualities individually, and the output is integrated onto the symbolic label. Furthermore, we sought to generate an augmented dataset using robotics and prove its effectiveness via experimental evaluations. We also aimed to construct and open (https://github.com/gerelmaab/Physically_augmented_guitar_chord_dataset) a large-vocabulary dataset of guitar chords, which is created by the physical data augmentation method.5.
The rest of this paper is organized as follows. In Section 2, related works are introduced. Section 3 describes the musical chord recognition system including physical data augmentation, chord recognition, and the dataset. MTL and the implementation of its architecture are described in Section 4. In Section 5, the evaluation metrics and set of experiments performed are detailed. We compare our proposed methods, including multi-task learning and physical data augmentation methods, to the baseline models, equivalent single-task architecture, and human dataset, and we evaluate the results. Section 6 presents the conclusions with further directions.

Related Works
ACR has been studied in various forms over many years. One of the earliest works is that of Fujishima et al., titled "Realtime Chord Recognition of Musical Sound: a System using Common Lisp Music" [12]. In this work, the author proposed the calculation of a 12-D chroma feature that is compared to a dictionary of binary chord templates. Since then, ACR has gained popularity in MIR. In recent years, CNN-based models have outperformed the conventional methods. In [8], the authors were the first to evaluate the CNN performance in ACR tasks. In this work, the authors used a CNN to classify a five-second tile of pitch spectra, achieving a performance competitive to the standard techniques. Along with the CNN, RNNs have also been applied to MIR tasks.
Boulanger-Lewandowski et al. studied the application of RNNs [13], wherein a RNN was used in the post-filtering phase for label sequence modeling. The RNNs were trained with the ground truth label sequences where the optimal label sequences were typically derived using the beam-search algorithm. McFee et al. [5] proposed large-vocabulary chord recognition based on a structured representation of the chord qualities. They used an extensive class of chord qualities, including triads, diminished, and suspended chords. A CNN and RNN can be used in one model, and this hybrid model is termed as a convolutional recurrent neural network (CRNN). Kenwoo et al. [14] showed that the CRNN benefits from the flexibility of the recurrent layers in summarizing information along time, thus achieving the best performance among the compared CNN structures in a music classification task.
A recent overview of MTL can be found in [15]. A MTL model based on a CNN architecture has achieved favorable results in computer vision tasks [16] and natural language processing tasks [17]. MTL has also been successfully employed in drug discovery [18] and chronic disease prediction tasks [19]. MTL has been applied to various MIR tasks owing to the various attributes of music, such as the chord, key, fundamental frequency, or baseline. The chord recognition accuracy can be increased with improved chord root recognition based on MTL [20]. In this work, the authors used cepstral features and a multi-task learning network based on neural networks to train chord recognition and root note recognition. In [21], the authors used a harmonic CQT feature and fully convolutional neural network based multi-task learning framework to estimate various tasks, including melody, vocal, and bass line estimation. The results showed that the proposed network outperformed the corresponding single-task network.
To date, different types of robot musicians have been developed [22]. Several studies have developed robotic guitars and basses, such as the LEMUR GuitarBot [23], and MechBass [24], a four-string robotic bass guitar. Another example is Strumbot [25], a standalone six-stringed robotic guitar. Besides these robotic guitars, robots that can play acoustic and electric guitars have also been developed.

Musical Chord Recognition System
In this section, we describe the proposed system and physical data generation method, and the input and output representations of the chord recognition system.

System Flow
The overall pipeline of the proposed recognition system is illustrated in Figure 1. In the data-driven approach, the quality and richness of the dataset are essential aspects. In this study, we used the physical data generation method to enrich and create a more balanced chord dataset, which is proposed in [26]. First, we developed a guitar-playing robot system that plays the chords automatically (Section 3.2). Then, a robotic performer was used to create the guitar chord dataset. A comparison between the traditional augmentation and physical data augmentation methods is shown in Figure 1a.
The proposed solution for the recognition system involves a multi-task learning model based on a CNN and RNN. The MTL model predicts two tasks in parallel, as illustrated in Figure 1b. The input of the recognition system is derived from the physical augmented dataset. The proposed system uses two-dimensional (2D) representations instead of the original one-dimensional (1D) representation of audio. The audio signals are first pre-processed and decomposed into small frames, and frame-wise feature extraction is performed. We tailored the feature extraction strategies to produce suitable inputs for the network (Section 3.3.2).
In the training phase, two types of related tasks are trained simultaneously. The two tasks take the same single input. The given input audio has multiple labels (root note and quality). The output root note and quality, respectively, are estimated/generated and then combined into one label. The output of the learning models is converted into label files to evaluate the performance of the architectures. We tested the recognition system only on a human dataset.

Physical Data Augmentation Guitar Playing Robot
The robot system comprises a string-pushing unit, picking unit, and linearly moving mechanical unit. A snapshot of the developed guitar-playing robot system is shown in Figure 2. The chassis of the robot was built with a T-slot aluminum extrusion, and the other supporting parts were made with acrylic and polylactic acid (PLA) filament. This type of robotic performer was also implemented in [26].

Pushing Unit
The guitar-playing robot pushes the strings correctly at the right positions and subsequently generates various types of tones from the guitar. For the string-pushing unit, a four-line solenoid bar with six solenoids used in each line to fit six strings were installed; in total, there are 24 solenoids, so that any variation of a chord can be played as shown in Figure 3. The solenoid bars are moved along the guitar neck, assisted by the single-axis robot. Figure 2b illustrates the implementation of the pushing unit, whose main body is made of acrylic. To produce at least 60 N from the push-pull solenoid, which is the required preload tension to generate an accurate tone, we used the double second class lever in which the mechanical advantage (MA) is always greater than 1 (Figure 2b). The law of levers is defined by Equation (1).
Hence, the input force (F i1 ) is approximately three times the output force (F o2 ). When sufficient force is applied to the pushing pillar, it pushes the string. A rubber band is used to move the pushing pillar up to its initial position.

Picking Unit
The robot can pluck the string by a linear or swinging motion near the guitar sound hole to generate sound. Normally, a human would make a vertical movement to pluck a guitar string; however, instead of a vertical movement, our robot makes horizontal a movement, as illustrated in Figure 2c. The picking unit was implemented with push-pull solenoids similar to the pushing unit. Each solenoid of the picking unit executes a controllable linear up and down motion using two solenoids that are deployed in opposite directions to each other. Six plectrums are attached to the picking unit, with one plectrum per string, and the guitar pick is designed to fit between two strings, as shown in Figure 2c. The plectrum holder was constructed using a 3D printer with a PLA filament. The picking units were installed above the guitar sound hole in a fixed position. Each picking component is independently controllable; therefore, the apparatus can be strummed using many types of strokes. The proposed robot system can play various strokes, including the fast, slow, up, and down strokes, as illustrated in Figure 3b.

Linear Movement
Common guitar chords can generally be played below the 5th fret. The pushing unit moves along with the guitar neck, to facilitate a variety of chords. The solenoid carriage, attached to a linear moving robot, moves across the first 12 frets of a guitar. The solenoid carriage rides along the T-slot aluminum extrusion, as shown in Figure 2a.

Electronic Controller
Electronics were used to control all units including the controlling board, stepper motor drivers, solenoid driver circuits, and power supply, which is used to power both the motors and electronics. We constructed an electronic board to control the solenoids, including six solenoid controlling circuits on one board, as shown in Figure 2a. The solenoids operate at 12 V and are controlled by an on/off voltage. Four boards are used to control the 24 solenoids. The controlling board was connected to the computer via a USB. The message from the computer to the controlling board contains chord information. Upon receipt of a message, the controlling board communicates with the stepper motor driver, driving the linear mechanism to a position corresponding to the received chord or note position. Subsequently, the corresponding solenoids are activated, and the accurate chord is played.

Musical Chord Recognition Part
We designed a multi-task architecture that jointly learns root note recognition and chord quality recognition in the training phase. In musical chord recognition, feature extraction operates over the frames. The pre-processed input is passed through two identical CRNNs to generate the root as the first output and quality as the second output. Each output of the sub-networks is then combined to obtain a single simple chord annotation form.

Input Representation
Many research works have shown that the time versus frequency representation is most suited for learning-based approaches in music [2,14]. We used the CQT as an input representation of the models. The CQT is a logarithmic representation of frequency, which can be easily calculated from a raw audio [27]. The main advantage of the CQT representation is that higher-frequency resolution is obtained at lower frequencies and higher time resolution is achievable at higher frequencies.
Abdel-Hamid et al. used the Mel-frequency spectral coefficients (MFSC) feature for speech recognition tasks along with their delta and delta-delta to describe the acoustic energy distribution in each of the several different frequency bands [28]. This representation is similar to the red, green, and blue channels of an image. These dynamic (delta) and acceleration (delta-delta) features are very efficient in speech recognition tasks [29]. Inspired by the above, the CQT, along with the corresponding delta and delta-delta features (first and second temporal derivatives of the CQT), can be applied to the chord recognition task. We can assume that the delta and delta-delta features can achieve improved performance at the chord boundaries, as the delta features have large values with sound changes. We experimentally demonstrate the effectiveness of the delta and delta-delta features. (Section 5.1) The block diagram depicting all the pre-processing steps is illustrated in Figure 4. Librosa [30] computed the CQT feature with a 4096-point hop size, resulting in a frame rate of approximately 10.8 Hz. In an acoustic guitar, the lowest note is E2, which has a fundamental frequency of 82 Hz, and the highest note is F6, which has a fundamental frequency of 1397 Hz. Therefore, we used the logarithmic CQT parameters to span five octaves, starting at the A1 note at about 55 Hz with 12 bins per octave. The preprocessing parameters are listed in Table 1. The output dimensions of this processing step were one frame × 60 bins.
The neighboring frames of the input representation can be expected to contain similar content, as the chords will not change on a frame-by-frame basis. Through systematic experiments on the validation folds (Section 5.2), we found that the scores of the root and quality are stabilized from a context window of ±0.45 s. In this study, each time 11 consecutive frames are input to the proposed model, frames around the target frame ±5 frames (±0.45 s audio clip) are chosen, which is termed as a superframe. The superframe is an approximately 1.0 s audio clip. This operation is applied to all the frames. The output of the pre-processing step is a 3 × 11 × 60 sized array, where 60 represents the number of CQT bins, 11 is the total number of frames, and 3 represents the CQT feature along with its derivatives.

Output Representation
In the dataset annotation, chord description is represented as G#:maj6(2,b5,5)/1. To formulate chord recognition as a classification task, we define a mapping of the dataset from chord description to chord vocabulary. First, we discard the chord inversions and suppressed or additional notes. For example: D#:sus2(7)/1 → D#:sus2 Next, we introduce some conversion from the chord label with a short duration to the analogy label. We use ten types of quality, including maj, min, aug, maj7, min7, 7, dim7, hdim7, sus2, and sus4. The final vocabulary contains 98 classes. We incorporate a conversion over a chord as follows: For the labeling, a single label is estimated with the superframe. The chord label of the middle frame is used to the audio context clip label as illustrated in Figure 5.

Datasets
The GuitarSet dataset: The GuitarSet contains musical audio files and the corresponding chord annotations, which were annotated with [31]. Three chord progressions are paired with five different genres, including rock, jazz, singer-songwriter, funk, and bossa nova. Six professional guitarists played an audio of 30 min audio in their own style, to produce 3 h of 180 audio recordings. There are two different chord annotations termed as the "instructed" and "performed" chord annotations.
The "instructed" chord annotation is written with the given chord sheet. In contrast, guitarists have modified the given chords to their playing style, which is called the "performed" annotation. These two types of annotations are not necessarily the same; however, the audio file is the same. As instructed chord label is modified with different chords by the players, we used the "performed" chord annotation.
The augmented dataset: The augmented dataset was recorded directly from the guitar playing robot. This dataset consists of 12 root notes and 10 types of chord quality in a total of 97 classes of chords. Each chord was played individually with five types of stroking patterns, including D DU DU DU D, D DU UDU, D D DU UDD, D D DU, and simple fingerstyle, where D represents a downstroke, U represents an upstroke, and a space represents the gap time between the strokes. The recording environment was a sound isolation chamber. The microphone was placed near the guitar sound hole. Each specific chord was recorded for approximately 40-45 s, indicating that the created augmented dataset was evenly distributed over all the chord types. This dataset was recorded in an identical manner as the human dataset, in the WAV format and sampled at fs = 44,100 Hz. We annotated the chords manually using the chord annotation method proposed by [31].
The Guitarset dataset is an unbalanced dataset over the chords, e.g., the C#:aug and C#:sus4 chords have the least duration of approximately 1.2 s; in contrast, the C#:maj chord has the longest duration of approximately 380 s. The distribution of the chord types of the Guitarset dataset is shown in Figure 6. To address this data scarcity or unbalanced problem, we propose a multi-task learning network. The proposed multi-task learning network trains the chord root and quality separately. Observing the chord distribution based on the chord root and quality in Tables 2 and 3, respectively, it can be seen that the model has considerably more information on the chord in terms of root and quality. It can be observed from the tables of the roots and qualities that the training examples are increased compared to all the chord classes. However, the chord qualities are also not flat (unbalanced); dim7 accounts for approximately 0.07% of the chord quality distribution, which is considerably smaller than that of the major chords, which is approximately 52.58%. In contrast, the root note is relatively evenly distributed over the GuitarSet dataset. In an augmented dataset, all chord types are at least approximately 45 s long.

Proposed Multi-Task Learning
This section details single-task learning and the structure of the proposed MTL model.

Conventional Single-Task Learning
The single-task learning network contains seven convolutional layers, two recurrent layers, and two fully connected (FC) layers. The architecture of the proposed convolutional recurrent neural network (CRNN) is shown in Figure 7a. The proposed architecture contains seven convolutional layers with a 3 × 3 receptive field and a rectified linear unit (ReLU) activation function, followed by batch normalization [32]. Five max-pooling layers of various sizes are employed, followed by a dropout [33] with 0.5 probability. We reshaped the output of the convolutional layer to make it compatible with the input size of the recurrent layer. A gated recurrent unit (GRU) with 32 neurons was used as the recurrent layer. The output of the recurrent layer was connected to the FC layer with 64 neurons, and the "softmax" activation function was used for output classification. The single-task learning model is used for comparison in the subsequent experiments.
Training deep learning systems usually requires large and balanced datasets to train a good system and learn accurate parameters. However, in some applications such as medical imaging or where there is a relative lack of data (certain classes of datasets are rare relative to other events, objects, or classes) this requirement cannot be satisfied. In a large-vocabulary chord dataset, the majority class examples, such as major and minor, far outnumber the minority class examples such as suspended or augmented. In these cases, the single-task learning model lacks knowledge of some classes, and accurate learning is difficult. However, multi-task learning is a useful approach in cases where useful information can be derived from other related learning tasks to handle the data sparsity problem.

Multi-Task Learning
The proposed model learns a mapping between an audio representation, such as the CQT, and the root notes and qualities. Essentially, the model can be viewed as a multi-task leaning model. From the pre-processed input, the proposed model computes the root and quality outputs directly. The model is trained with a multi-task learning method using two cost functions. In each cost function, the cross entropy (CE) between the ground truth and predicted labels is used. In summary, the root note and quality are modeled jointly with a shared network by the proposed model. This learning method is designed for chord recognition.
In the case where the labeled data for one of the tasks are scarce, MTL can be adapted to the labeled data of the related task. In our dataset, the suspended, augmented quality training examples are short compared with the other classes of quality. As the two tasks have a different number of outputs, the potentially shareable layers are the middle layers of the model, resulting in a final layer of different dimensionality. In the proposed model, one sub-network learns 12 root notes, while the other sub-network learns 10 types of qualities, including major, minor, dominant seventh, dominant major and minor seventh, suspended, augmented, and half diminished 7 and diminished 7. The outputs of the two networks are then combined to obtain the "root:quality" annotation form, which was proposed in [31]. In this approach, the number of output classes reduces to 12 and 10 for the root and quality, respectively, compared with the single-task learning model, with no change in the number of training examples.

Multi-Task Learning Architecture
We combined MTL with the CNN and RNN framework by sharing some layers between the two related tasks. The input features are passed through two identical sub-networks, where each sub-network architecture is the same as a single-task learning model except for the output layer. The two identical architectures perform root note and quality recognition. The output layers have a different number of outputs. The output of the first task is 12 because there are twelve music notes.
The output of the second task is 10 because ten types of chord qualities are employed. The proposed MTL model is shown in Figure 7b.
The MTL model has one shared layer and uses soft parameter sharing in the FC layers of the two sub-networks. The output of the FC layer of the root learning sub-network is concatenated to the output of the FC layer of the quality learning sub-network. Therefore, the classification of quality has some information of the chord root.

Training
In the experiment, 6-fold cross-validation was applied to the entire dataset. In each training fold, five players' recordings were used for training, while the other was used for testing the dataset. First, the complete network was trained only with the human dataset (Guitarset dataset). Subsequently, the physical data augmented dataset was combined with the training subset of the human dataset, and the same experiment as the first was performed. In the second experiment, the test subset only consisted of the human played chords, similar to the first experiment.
In the training phase, a frame-based strategy was used, wherein each song was divided into frames, and each frame was treated as an independent training sample. On average, each song was divided into approximately 250-350 frames, resulting in approximately 32,500 and 75,000 training samples in the human and augmented datasets, respectively, and approximately 9800 test samples.
The neighboring frames of the input representation can be expected to contain similar content, as the chords will not change on a frame-by-frame basis. Therefore, the models were trained with 11 frames (of 1 s duration). The networks were implemented using the Keras library that runs on top of TensorFlow. The Adam optimizer [34] was employed to train the networks for up to 300 epochs with a learning rate of 0.001 and a batch size of 100. Training was stopped early if there was no improvement in the validation loss after 10 epochs.

Experimental Result
Weighted chord symbol recall (WCSR) was used for evaluation. The WCSR score can be computed using Equation (2), where t c is the duration of the correctly classified chord segments and t a is the duration of the entire chord segments.
The WCSR score was computed with mir_eval [35]. The root, maj-min, thirds, triads, sevenths, tetrads, and MIREX scores were used to compare the results. In mir_eval, the root only compares chord root; maj-min compares the 12 major, 12 minor, and 'no chord' classes; thirds compare root and thirds; triads compare root, third, and fifth; sevenths compare root, thirds, fifths, and sevenths; tetrads compare all intervals, and MIREX compares at least three correct notes. The output of the learning models was converted into label files to calculate the scores of mir_eval.

Effects of CQT with Delta, Delta-Delta Features
In this experiment, cross-validation was not employed. In both the model cases, the GuitarSet dataset was used to train and test the models. Five players' recordings were used as the training dataset, and the other player's recordings were used as the test dataset. The CNN model is a VGG-style [36] CNN. Table 4 shows the advantages of using the delta and delta-delta features, resulting in significantly improved accuracy in both the models and datasets.

Effects of Audio Context Size
Here, a VGG-style CNN architecture was used to evaluate the context size. We determined the optimal amount of context for each score experimentally using the one-fold experiment, as shown in Figure 8. The results indicate that the scores were nearly stable from 1 s onwards, which has a context size of 5. Therefore, we chose a 1 s audio context for all the experiments.

Effects of Traditional Augmentation and Physical Augmentation Methods
In this experiment, we used the simple CNN model with the CQT and delta and delta-delta features, along with a small dataset. In the data augmentation method, time-stretching data augmentation was used, which slows down or speeds up the audio sample while keeping the pitch unchanged. Each sample was time-stretched using four randomly selected factors: 0.8, 0.9, 1.1, and 1.2. Four types of metrics were used for comparison because a small dataset was employed. Table 5 shows the advantages of using the physical data augmentation method, which results in improved accuracy.

Overall Performance
Here, we compare the performance of the proposed and baseline methods on the same datasets. It is not simple to define a particular method as a state-of-the-art method because the chord recognition methods were evaluated on different datasets. Among the different models, we chose four baseline methods, namely the deep neural network (DNN) model, VGG-style CNN model, CNN [4] model, and combination of a CNN and gated recurrent unit, termed CR2 in [5]. The DNN has two hidden layers with 128 and 64 neurons, and an output layer. The softmax activation function is used for classification. In the VGG-style CNN architecture, four convolutional layers with a 3 × 3 filter are followed by a max-pooling layer of size 2 × 1. After this, two convolutional layers and max-pooling layers of size 2 × 2 are connected. This is followed by one convolutional layer and max-pooling layer of size 1 × 2. A FC layer with a softmax activation function is used for classification. Dropout is applied after each max-pooling and FC layer. The CNN [4] includes eight convolutional layers, two max-pooling layers, and an average pooling layer instead of a flatten layer. Dropout and batch normalization [32] are used to prevent overfitting and speed up training convergence, respectively. The CR2 model [5] consists of two convolutional layers and two bidirectional GRU units. The baseline models were re-implemented to suit our input feature representation.
The experimental results of the baseline models and proposed methods are presented in Table 6. The best value obtained for each metric is highlighted in bold. Among the baseline models, the VGG-CNN model, which is baseline model 2, achieved the best results in terms of most metrics. When the augmented dataset was used, the quality scores increased except in the case of the CR2 model. When models with single-task learning were included, the best performance was obtained in terms of all the metrics.
When comparing the proposed methods, the MTL model trained with the augmented dataset, which was proposed method 3, obtained better performance in terms of four metrics, namely the root, thirds, tetrads, and maj-min. The other metrics showed comparable performance with the proposed methods. The STL performed better for the triad, sevenths, and MIREX. The proposed method achieved better performance than the baseline models and single-task learning model. The MTL models achieved better scores for chord root and triad recognition.

Discussion
Based on the above results, a CNN proved more effective for the chord recognition task compared with the traditional DNN learning method. The proposed MTL model based on the CNN and RNN networks showed good performance for the large-vocabulary guitar chord recognition task. The presented results clearly demonstrate that the performance of the proposed method was superior to that of the baseline models and single-task learning model. We demonstrated that the static feature along with the dynamic (delta) and acceleration (delta-delta) features are useful for the guitar chord recognition task. In contrast, the physical data augmentation method showed acceptable performance in the guitar chord recognition task. This physical data augmentation method can be used for any musical instrument. Using a robotic performer, we can create a large-sized and balanced dataset, which also includes sufficient training examples of the rarely played chords such as the sustained and eleventh. However, there are some limitations, including the hardware implementation and complexity of the robot design.

Conclusions
In this paper, we applied a multi-task learning model based on a CNN and RNN to a relatively large-vocabulary guitar chord recognition task and demonstrated the physical data augmentation method and its utility/performance. The experimental results demonstrated that the multi-task learning model outperformed the baseline models and achieved better performance than its equivalent single-task model. In addition, we found that an augmented dataset created by a robot is also efficient for guitar chord recognition. Both dynamic and acceleration features were used as the model's input features, resulting in good performance. The models trained by a human with the augmented dataset achieved improved performance in terms of most of the metrics. An advantage of multi-task learning for chord recognition is that its training speed is faster than that of single-task learning.

Conflicts of Interest:
The authors have no conflicts of interest to declare.