Clinically Feasible and Accurate View Classification of Echocardiographic Images Using Deep Learning

A proper echocardiographic study requires several video clips recorded from different acquisition angles for observation of the complex cardiac anatomy. However, these video clips are not necessarily labeled in a database. Identification of the acquired view becomes the first step of analyzing an echocardiogram. Currently, there is no consensus whether the mislabeled samples can be used to create a feasible clinical prediction model of ejection fraction (EF). The aim of this study was to test two types of input methods for the classification of images, and to test the accuracy of the prediction model for EF in a learning database containing mislabeled images that were not checked by observers. We enrolled 340 patients with five standard views (long axis, short axis, 3-chamber view, 4-chamber view and 2-chamber view) and 10 images in a cycle, used for training a convolutional neural network to classify views (total 17,000 labeled images). All DICOM images were rigidly registered and rescaled into a reference image to fit the size of echocardiographic images. We employed 5-fold cross validation to examine model performance. We tested models trained by two types of data, averaged images and 10 selected images. Our best model (from 10 selected images) classified video views with 98.1% overall test accuracy in the independent cohort. In our view classification model, 1.9% of the images were mislabeled. To determine if this 98.1% accuracy was acceptable for creating the clinical prediction model using echocardiographic data, we tested the prediction model for EF using learning data with a 1.9% error rate. The accuracy of the prediction model for EF was warranted, even with training data containing 1.9% mislabeled images. The CNN algorithm can classify images into five standard views in a clinical setting. Our results suggest that this approach may provide a clinically feasible accuracy level of view classification for the analysis of echocardiographic data.


Details of the classification model used
A detailed structure of the deep neural network (DNN) model used in this study is shown in Fig. S1, where the network comprises a total of 5 convolutional layers of N, 2N, 2N, 2N, and N filters with kernel sizes of 3 × 3, and 5 pooling layers of kernel size 2 × 2 were applied. A series of 2 fully connected layers-with 512, and 5 units-were calculated in the final layer. Activation function used in all activation layers is Rectified Linear Unit (Relu) except for the last activation, where the softmax function is employed. Number of the filters in the layer is controlled by N, and the result with N = 64 was shown in the manuscript. This value was determined by the grid search in the range of [16,256] in the initial stage of the present work (see the next section).

Model optimization
We checked the appropriate layer number and its filter number at the initial stage of the present study with the selected cohort. In this stage, the averaged image was employed. The cohort included 340 patients, which was divided as (204, 68, 68) for (train, validation, test), respectively. In the training, the cross-entropy error function was employed as the loss function.
We used the ADAM optimizer with the default parameters except for an initial learning rate of 0.0005 and batch size of 5. The number of the epochs was set as 50. The model weights were stored when the validation loss was minimized. The accuracy of the view classification for the test cohort was used in the evaluation. The result was shown in Fig. S3, where the 3D bar graph indicating the accuracy is plotted for various numbers of the layer and the filter. It was found that 5 layers with N = 64 filters yielded the best performance. Using this model, the dependency of the input image size was also checked and shown in Fig.S4.

Other models
We examined DNN models other than the model described in the manuscript; two dimensional (2D) CNN models by utilizing the RGB channel with three images. The model is a similar framework of Fig. S1, though the input image has three channels. For this, we selected 3 out of 10 sequential images. Here, the image pair giving the minimum of the cross correlation was first  selected and the rest was selected from the middle of the image pair. The selected 3 images were assigned into the RGB channel (hereafter, the model using RGB images is called as "RGB model"). The training data were augmented by sliding the initial image of a cardiac cycle in each data. Other than this original model, for comparison, we also examined well-established models indicated in https://keras.io/applications/. In this supplement, we show the result from "ResNet50" which is one of the representative models in ImageNet.
In Fig. S5, the view classification accuracy for test cohort (68 cases) are indicated. In this comparison, the cohort included 340 patients, which was divided as (204, 68, 68) for (train, validation, test). The RGB model and the ResNet50 model gives a comparable result with that in the 10-images model.