1. Introduction
The automatic detection and classification of cardiac abnormalities from 12-lead ECG signals has been an area of research interest for a long time [
1]. Methods have ranged from medical decision-support systems to statistical approaches, from simple neural network architectures to more sophisticated methods based on deep neural networks [
1,
2,
3]. There has been much focus on research employing the use of deep learning with medical images [
4], time series classification [
5], and object detection [
6]. In [
7], a deep recurrent neural network approach was developed and tested for the classification of four types of the severity of atrial fibrillation (AF) based on 21 features. The use of continuous wavelet transforms (CWTs) for ECG signal processing is present in several studies; for example, in [
8] the CWT was considered for multiscale parameter estimation for delineation of the fiducial points of P-QRS-T waves.
Recent examples of diagnostic 12-lead ECG classification have been reported. They come from the use of a deep neural network for the classification of six diagnostic classes [
3], whereas the study in [
9] considered the analysis of 12-lead ECG signals based on deep learning for the classification of four types of arrhythmias. A deep learning neural network model was tested in a database of 6788 12-lead ECG records for the identification of nine diagnostic classes [
10].
Consequently, many algorithms may be used to identify cardiac abnormalities. However, most of these methods are trained, tested or developed in relatively small or homogeneous databases, and most of them focus on identifying a small number of cardiac arrhythmias that do not represent the full complexity of ECG classifications [
11]. After a long series of interesting annual challenges, the PhysioNet/Computing in Cardiology Challenge 2020 provided the opportunity to address these problems, considering an extended set of diagnostic classes and a set of learning/testing ECG records belonging to different databases [
11,
12,
13].
The main objective of this study was to test two different techniques for the automatic classification of ECG signals with active participation in the PhysioNet/Computing in Cardiology Challenge 2020. In particular, the classical rule-based system method, as well as a more sophisticated technique based on direct learning from ECG raw data through deep learning architectures, are explored and compared in the same framework.
  3. Results and Discussion
The score indices of the first and second phase of the Challenge (validation scores) are defined and reported in [
11]. In particular, based on the indices of true positive (TP), true negative (TN), false positive (FP), and false negative (FN), precision (TP/(TP + FP)) and recall (TP/(TP + FN)) the following indices were considered:
- F 1-  is a F-measure, which is the harmonic mean of precision and recall:
       
- F 2- , a more general F-measure which weighs recall more highly than precision:
       
- AUROC: area under the receiver operating characteristic (ROC) curve 
- AUPRC: area under the precision-recall curve. 
- Our team, named ‘Gio_Ivo’, participated successfully in the unofficial and official phases of the Challenge. 
In a preliminary phase, the learning process was based only on the CPSC database, consisting of 6877 ECG records with only nine possible diagnostic classes, with a consequent simplification both of the rule-based method and the architecture of the CNN. 
Table 5 displays the cross-validation indices of the tested algorithms in this preliminary dataset. 
In the official Challenge phase, the entire learning set of 43,101 ECG records was considered, and the number of diagnostic classes increased to 110. The challenge scoring system was essentially concentrated on a subset of 27 classes, considering the relevant diagnostic classes of clinical interest. A particular scoring system was defined by the Challenge for coping with the fact that not all misdiagnosed results are equally bad. In addition, a subset of 24 classes was activated in the identification process, considering three equivalent classes (CRBBB and RBBB, PAC and SVPB and PVC and VEB). During this official phase, the submissions were tested on the validation set of 6630(1463 + 5167) records. To increase the efficiency of the learning process, the learning subsets LS_N1000 (16,002 records), LS_N600 (11,210 records), and LS_N1500 (20,044 records) were used in the testing procedures (
Table 3). 
Table 2 shows the weighted distribution of the learning set LS_N1000 in the 24 diagnostic classes considered.
The deep learning process was performed and tested using three-fold cross-validation techniques. This choice was mainly due to the CPU time required for the training. For example, for a one-fold training iteration, the execution took from 15 to 24 h of CPU time. However, in the submitted algorithms, the presence of several platform-related problems slowed the training process, and consequently, the learning was performed one-fold to ensure an acceptable duration of the learning process and a more convenient feedback phase.
Table 6 reports the official Challenge Validation score of the submitted algorithms tested in the validation set of 6630 records. The rule-based method RB1 essentially did not use any learning process from the database LS_N1000 and the score was in agreement with the behavior of the first phase, whereas the second version (RB2) tried to extract some information from LS_N1000. For example, it tried to differentiate AF from AFL on the basis of the AF-waves’ frequency and amplitude, but the consequent improvement was not significant.
 Different deep learning algorithms were submitted, with different learning subsets (LS_N1000, LS_N600, LS_N1500) and a different number of iterations, but the scores (
Table 6) were all in the range of [0.400, 0.426], indicating that all these algorithms showed similar behavior. In particular, GoogLeNet_6 resumed the training from a previously saved pretrained network, which comes from a 3-fold cross-validation technique on LS_N1000 and 10 iterations.
Table 7 displays the cross-validation indices trained and tested in the learning databases LS_N1000 and LS_N1500. It is interesting to note that the reported indices F_2, G_2 and the normalized score are in agreement with the official results, with some more optimistic results, probably depending on the composition of the unknown test set.
 The final official results were announced considering the test set of 16,630 ECG records. Our team, named ‘Gio_Ivo’, submitted the deep learning method GoogLeNet_6, and achieved a challenge validation score of 0.426 and a full test score of 0.298, thus placing us 12th out of 41 in the official ranking. In particular, 
Table 8 reports the various official validation score performance indices in the different hidden test/validation sets. The presence of a hidden undisclosed set (10,000 ECG records) from an American institution geographically distinct from the other datasets caused a significant decrease in the Challenge score. This critical point is significant, showing the importance of the composition of the learning/testing sets.
Table 9 shows the AUROC, AUPRC and the F
1 scores for the considered diagnostic classes. In this table, we can observe the weak points of the classifier. Three diagnostic classes had very low F
1 scores: Bradycardia (0.0), PR (0.05) and RAD (0.053), which corresponded to the three classes with the lowest numbers of examples (288, 340 and 427, respectively), and also correspondingly low AUPRC values (0.001, 0.019 and 0.025, respectively). These results confirm the critical point of the problem of class imbalance and show the limits of the random over-sampling technique.
 The results clearly show that the deep learning architecture that directly examines raw ECG data and time-frequency images is able to produce satisfactory results.
Various teams that participated in the Physionet/Challenge considered the deep learning approach [
27,
28,
29,
30], showing a particular interest in this methodology. For example, the team with the highest score [
27] considered both raw ECG data and ECG features extracted from ECG signals, including age and gender. A deep neural network with a modified residual neural network architecture was considered in [
28], in which the scatter blocks processed the 12 leads separately. In [
29], wavelet analysis and a convolutional network were used for each single lead, and a single output label was obtained, reducing the diagnostic categories to the individual and the most frequent combinations. In [
30], the authors combined a rule-based model and a squeeze-and-excitation network.
Over recent years, there has been a rapid development of machine learning techniques, with a growing number of ECG classifiers [
3,
31]. These algorithms consider different sets of cardiac arrhythmias and small or relatively homogeneous datasets, reducing the possibility of a real comparison [
11]. For example, in [
31] the authors consider 12 classes, in [
3] they consider six cardiac abnormalities, whereas the present work considers a set of 24 relevant diagnostic classes of clinical interest, making a direct comparison complex.
Some of the characteristics of the proposed methods can be outlined. The RBM method mimics the classification process of an expert physician, and it obtain the classification in a very short time. However, the accuracy and the mimicking property could be improved with significant effort, considering, for example, some active tuning from the learning database, with more modular rules and fuzzy thresholds. The deep learning method is characterized by the use of a linear architecture fed only with raw ECG data, in which all the leads are examined simultaneously, considering a multi-label classifier with a large number of diagnostic classes, with a positive behavior in the presence of a significant class imbalance. This method has the drawback of complexity and a long training time. The use of pre-trained CNNs has simplified the training process; however, more specific architectures of deep learning could improve the classification accuracy.
  4. Conclusions
In the present study, we have explored the potential of a classical rule-based method and a deep learning architecture for the automatic classification of ECG signals. The two methods were tested and validated in the framework of the PhysioNet/Computing in Cardiology Challenge 2020, in which six annotated databases of 43,101 ECG records were considered for the training set. The training and validation databases contained a set of 27 relevant diagnostic classes of clinical interest, which represents the complexity and difficulty of ECG interpretation. A particular scoring system was defined by the Challenge judges because not all misdiagnosed classifications are equally bad.
The results of the two different techniques showed that deep learning methods which directly examine raw ECG data and images are able to produce very satisfactory results. In addition, this technique, which is quite a simple methodology but with a high consumption of computation capacity, performs better than the classical rule-based system.
The reported results showed that our team was able to complete the challenge steps with two different methods. The final official results of our team, performed using the deep learning GoogLeNet_6 approach, achieved a challenge validation score of 0.426 and a full test score of 0.298, resulting in our team placing 12th out of 41 in the official rankings. The PhysioNet/Computing in Cardiology Challenge 2020 has provided the opportunity for unbiased and comparable research for testing the complexity of 12-lead ECG classifiers with a large public training set, as well as undisclosed validation and test sets.
Among the topics open for future investigations are the development of class-imbalance analysis, multi-label datasets and unequal sample sizes, in addition to the combination of the two proposed methods.