A Convolutional Neural Network Combining Discriminative Dictionary Learning and Sequence Tracking for Left Ventricular Detection

Cardiac MRI left ventricular (LV) detection is frequently employed to assist cardiac registration or segmentation in computer-aided diagnosis of heart diseases. Focusing on the challenging problems in LV detection, such as the large span and varying size of LV areas in MRI, as well as the heterogeneous myocardial and blood pool parts in LV areas, a convolutional neural network (CNN) detection method combining discriminative dictionary learning and sequence tracking is proposed in this paper. To efficiently represent the different sub-objects in LV area, the method deploys discriminant dictionary to classify the superpixel oversegmented regions, then the target LV region is constructed by label merging and multi-scale adaptive anchors are generated in the target region for handling the varying sizes. Combining with non-differential anchors in regional proposal network, the left ventricle object is localized by the CNN based regression and classification strategy. In order to solve the problem of slow classification speed of discriminative dictionary, a fast generation module of left ventricular scale adaptive anchors based on sequence tracking is also proposed on the same individual. The method and its variants were tested on the heart atlas data set. Experimental results verified the effectiveness of the proposed method and according to some evaluation indicators, it obtained 92.95% in AP50 metric and it was the most competitive result compared to typical related methods. The combination of discriminative dictionary learning and scale adaptive anchor improves adaptability of the proposed algorithm to the varying left ventricular areas. This study would be beneficial in some cardiac image processing such as region-of-interest cropping and left ventricle volume measurement.


Introduction
Anatomical or tissue object detection has receiving increasing attention in medical image analysis field since it facilitates to isolate the area of interest from the background and determine the relevant position and category in medical images [1][2][3]. In the image-based assisted diagnosis process of cardiovascular diseases, it is an important preparation stage in the detection of left ventricle and myocardium in cardiovascular magnetic resonance imaging (MRI) slices. Accurate automatic left ventricular localization will provide great help for subsequent registration, segmentation, automatic ejection fraction calculation, and other tasks, so as to assist doctors in clinical diagnosis of heart health.
So far, a limited number of automatic cardiac object detection methods have been proposed [4,5] and some deep learning based methods have also been investigated [6,7]. However, it is a very challenging problem because of the complex distribution of various organs and tissues in cardiac MRI, intensity inhomogeneity, insignificant intensity differ-ences between the myocardium and surrounding tissues, and large changes in the shape and size of the left ventricle region at different slice locations.
In this paper, we propose a convolutional neural network (CNN) method specifically designed to detect the left ventricle object in cardiac MRI slices. Our approach takes the detection into coarse oversegmentation location, discriminative dictinary learning-based location adjustment, and fine CNN-based detection parts with interpretability. The unsupervised SLIC (simple linear iterative cluster) algorithm is introduced to obtain initial oversegmentation because it has compact structure, good edge fitness, low complexity, and flexible adjustment. In addition, it does not require a preliminary training process in comparison to the CNN-based region proposal subnetwork. The discriminative dictionary learning can efficiently represent the different sub-objects in left ventricle area, and the sequence tracking fast generates left ventricular scale adaptive anchors on the same individual to solve the problem of slow classification speed of discriminative dictionary. Once multi-scale adaptive anchors are generated in the target region for handling the varying sizes, they are trained in regional proposal network, and the left ventricle object is localized by the CNN based regression and classification strategy. In the following sections we will present related works, our motivation, the details of our method, and experimental validation.

Related Works
Traditionally, the object detection methods in the computer vision field can be directly applied for left ventricular detection. The representative method was originally developed for face detection, which introduces a sliding window to traverse the entire image, designs single or multiple handcrafted features (scale invariant feature transform-SIFT [8], histograms of oriented gradients-HOG [9], etc.) to characterize the image area and builds traditional classification algorithms (support vector machines-SVM [10], AdaBoost [11], etc.) to classify the region. The semi-automatic approach [12] can even significantly reduce clinicians' the amount of work. One obvious shortcoming in these methods arises from the redundant regional anchors since sliding windows are generated in an untargeted manner and another comes from the handcrafted features that have weak ability to represent the image region, which seriously affects the detection speed and performance.
To overcome the above limitations, many CNN (Convolutional Neural Network) based detection methods have been investigated and they achieved more satisfactory performance in nature images, these methods can be mainly divided into one-phase based, two-phase based, and anchor-free based methods.
One-stage object detection methods: This category aims to directly classify and regress the sliding windows, such as Yolo (you only look once) series [13][14][15], SSD (single shot multibox detector) [16], and their improvements. In the essential flowchart of this category, the input image is divided into multiple grids and each grid is responsible for predicting the target whose center falls within the grid and returns to the position and category of the bounding box directly at the output layer. Yolov3 is also used as automatic bounding box annotation in a recently proposed CPGGANs (Conditional Progressive Growing of Generative Adversarial Networks) to generate highly-rough bounding box conditions to place brain metastases at desired positions/sizes for MRI brain tumor detection [17]. It is characterized by fast detection speed but also the imbalance of positive and negative sample categories during training, resulting in poor robustness of targets with large morphological changes. To handle this, Retinanet [18] is proposed with a focal loss function to effectively alleviate category imbalances during training and achieved certain effects. On the other hand, Refinedet [19] is introduced to combine feature fusion and box regression from coarse to fine stages by following the SSD network structure, which improves certain accuracy and retains considerable detection speed.
Two-stage object detection methods: This category intends to generate proposals separately and then perform regression and classification. Representative algorithms include RCNN (regional CNN) [20], Fast RCNN [21], Faster RCNN [22], and their improvements. Although these algorithms are characterized by poor real-time performance, they have certain advantages in detection accuracy, and the intermediate process has strong interpretability. For example, the selective search technique [23] in RCNN generates region proposals with visible sense and is helpful for extracting CNN features one by one to feed classification and border regression. Fast RCNN introduces the idea of SPP-Net [24] to perform feature extraction on the entire image, achieving feature map sharing to improve the detection speed. Furthermore, Faster RCNN [22] uses RPN to generate region proposals and achieve end-to-end detection, which improves both detection speed and accuracy. In addition, R-FCN [25] removes all fully connected layers in Fast RCNN and proposes position sensitive score maps. Mask RCNN [26] presented a segmentation branch by using FCN (fully connected network) [27] for semantic segmentation, and it achieves the decoupling of mask and class prediction relationship where the introduced RoI Align instead of RoI pooling significantly improves the accuracy of the mask. The two-stage Mask-RCNN method is also proposed to detect and segment the optic nerve head and optic disc of retinal images [28], where the first stage aims to coarse positioning of the optic nerve head and the cropped images are used as the input of the second stage. The final result is largely dependent on the coarse positioning of the first stage. Recently, a cascade RCNN [29] is proposed to set different IOU (intersection over union) thresholds for positive and negative samples to improve the detection accuracy through multiple regressions on the bounding box. The cascaded localization regression network is also introduced for kidney localization [30], which detects the positions of kidney from two-dimensional cross-sectional slices in three orthogonal directions in one stage. The optimized training mechanism is also employed to improve the segmentation of left and right ventricles and myocardium on small sample data sets [31].
Anchor-free methods: This category is to directly predict the position of objects by inferring key point locations, such as Cornernet [32] and Centernet [33]. The idea of this category is to detect key points instead to key regions and has been applied in human posture estimation. To estimate a very large but extremely sparse bounding box dependent probability distribution, Denet [34] is proposed to extract a corner area of interest to filter most proposals by directed sparse sampling, and then employ it in a single endto-end CNN based detection model. To alleviate the anchor box optimization in single shot detection methods, FSAF (feature select anchor-free module) [35] is proposed to combine the anchor-based and anchor-free branches, which includes anchor-free branch and online feature selection; the former is used for each pixel classification and coordinate regression, and the latter is responsible for online detection of each ground truth feature map. In addition, synthesized training data by generating adversarial networks can expand the lack of data in the real image distribution, balance the number of real and synthetic training data, and achieve the better detection performance with additional workload [36]. The quantification results by deep learning in multicenter study [37] show that CNN can reach the same accuracy as the expert measurement with fast quantification.
There are less reports of traditional methods on the detection of left ventricle in cardiac MRI. Focusing more on the location of left ventricle [4], automatic segmentation of left ventricular myocardial area can be smoothly employed [38,39]. In recent years, several studies have been published on left ventricular detection in MRI based on deep learning. The long short-term memory (LSTM) was proposed to detect left ventricular sequences [5], but the cardiac MRI sequence is not as coherent as the video sequence; on the contrary, some frames have mutation, and the network structure did not include a regression module, resulting in inaccurate detection frames. The sliding window strategy is also adopted to generate fixed size and fixed position proposals, and then CNN is used for classification [40]. Due to the fixed size and position of the proposals, the overlap rate between the detection box and ground-truth one is low. The unsupervised stacked sparse auto-encoders (SSAE) depth features was integrated into candidate region generation and SVM classifier and regressor were trained [7]. Although SSAE can learn better features, it can not effectively fuse myocardial and blood pool regions with significant gray-scale differences, thus affecting the accuracy of detection box. Besides this limitation, the number of positive samples is insufficient and there is a lack of diversity. At the same time, an excessive amount of negative samples leads to an imbalance of positive and negative samples in network training, and the detection results are not satisfactory. Recently, a deep learning-based left ventricle detection and classification method was proposed [6]; in this method, the additional convolution filter layer was added to enhance the feature map along with the dropout layer, which intrinsically belongs to the classification of blood pool and myocardium instead of the left ventricle object detection.

Motivation and Contribution
The design of many general frameworks in target detection does not depend on image categories. Although they can be transferred to medical MRI after modification, it is worth noting that medical MRI can only be obtained through professional medical equipment, unlike natural images, which can easily obtain a large number of data. In many cases, the natural images are sent to deep CNN such as Resnet152, Resnet101, or Resnet50 [41] for achieving good results. Due to the relatively small number of medical MRIs, it is often difficult to fit deeper network, so the shallow backbones are usually chosen In computerassisted diagnosis and treatment, the accuracy counts more than the processing speed in most cases. At the same time, the interpretability and visualization of intermediate processes are important to assist the diagnosis of doctors. However, the one-stage detection algorithm and anchor-free algorithm often treat the network as a black box, and it is difficult to meet the practical application to obtain the detection results directly from the input images. Currently, the left ventricle detection is mainly based on the two-stage methods.
Based on the above considerations, this paper focuses on the two-stage convolutional network algorithms for left ventricle detection in cardiac MRI , where vgg-16 [42] is taken as the backbone.
Among the classical two-stage detection algorithms, Faster RCNN shows state-of-theart performance. However, it has certain limitations when it is directly applied to cardiac MRI left ventricle detection. Figure 1 reports some examples of error detection and the possible reasons are mainly due to the following three points: (1) The shape and size of the left ventricle vary greatly in cardiac MRI slices. As shown in Figure 2, the left ventricle target area has a large size span from 33 pixels to 14,134 pixels, and the size of the sliding anchors in the Faster RCNN is artificially set, the size of anchors is not significant and cannot be adapted to different left ventricular regions at different scales. (2) The capture of perception area of RPN in Faster RCNN is weak. As shown in Figure 2, 49.96% of the size is below 1766 pixels, 23.99% of the left ventricle area accounts for less than 1.88% of the entire image, and part is a small size area. while the perception area of RPN in the original image is 16 × 16 (unit: pixel). In each slice, the anchor center may be far away from the center of the left ventricular area, the low overlap between anchor and ground-truth results in the imbalance of positive, and negative sample types in RPN training. At the same time, due to the lack of high-quality negative samples around the threshold boundary of positive and negative samples in the proposed layer, the network has poor discrimination ability for the hard to distinguish negative samples.
(3) The distribution of organs and tissues in cardiac MRI is complicated, and some organs and left ventricle have little difference. RPN generates non-differential anchors in the original image, which leads to misdetection of similar regions of the left ventricle. At the same time, due to the lack of adaptability of the generated anchors to small-sized ventricular regions, missed detection is prone to occur.
In view of the above problems, this paper proposes a CNN left ventricular detection algorithm combining discriminant dictionary learning and sequence tracking. The algorithm applies a discriminant dictionary learning to classify the superpixel oversegmented regions, then merge the target region of left ventricle, and generate multiscale adaptive anchors in the target region. Combining with non-differential anchors in RPN, the left ventricle is detected by CNN regression and classification strategy. In order to solve the problem of slow classification speed of discriminative dictionary, a fast generation module of left ventricular scale adaptive anchors based on sequence tracking is proposed on the same individual. Specifically, this paper provides three contributions as follows.
(1) During the training phase, this paper classifies the segmented superpixel regions by discriminative dictionary, fuses the same labels to construct the left ventricular candidate region, generates scale adaptive anchors in the candidate region, and combines anchors in RPN training to increase the number of positive samples and alleviate the imbalance problem of positive and negative samples, randomly perturb the scale adaptive anchors, construct the diversity of positive samples, obtain some high-quality negative samples, add them into the proposal layer to enhance the network's discrimination of hard negative samples.
(2) During the detection phase, the adaptive anchors of the left ventricular region scale are mapped to RPN and proposal layers, respectively, to effectively prevent false detection and missed detection and enhance the robustness of the network to small-sized ventricular regions.
(3) Considering the slow classification speed of discriminative dictionary, this paper proposes a rapid generation module of scale adaptive anchors applicable to the left ventricle region in the same individual according to the idea in discriminative scale space tracking [43]. It can effectively improve the detection speed in the same individual.
The rest of this article is organized as follows. The second section presents the details of the proposed method, and the third section reports the experimental results and makes discussion. Finally, the fourth section concludes the study.

Overview
The proposed overall framework of left ventricular detection in cardiac MRI is shown in Figure 3. It mainly consists of three modules: (1) Left ventricle candidate region building module. This module adopts the unsupervised SLIC algorithm [44] to segment multiple MRI slices with left ventricular tags, clusters other regions outside the ventricle, and divides them into seven types of superpixel regions. The dictionary is trained from these superpixel regions and then applied in oversegmented testing images, where the superpixel regions are classified one by one, and a region fusion algorithm is performed by the same label to generate candidate regions of the left ventricle.
(2) Detection network module. This module follows the network structure of Faster RCNN, and uses vgg-16 to extract features. In the training phase, the left ventricular candidate region scale adaptive anchors are added to the RPN, and the diversity of positive samples is constructed by randomly perturbing the scale adaptive anchors, while generating some high-quality negative samples to be added to the proposal layer. During the test phase, anchors are mapped to the RPN and the proposal layer.
(3) Fast generation module of left ventricular scale adaptive anchors. According to the detection results of the current frame, all MRI slices of the same individual are stacked along the short axis. The module can quickly generate left ventricular scale adaptive anchors on each cardiac MRI along the frame, which significantly saves time cost and achieves high accuracy.
The details of the modules will be presented as follows.

Left Ventricular Candidate Region Generation Module
This module aims to automatically build the region candidates containing left ventricles based on oversegmentation and region merging techniques. As shown in Figure 4, the module is mainly divided into a training stage and a testing stage as below. Figure 4. Block diagram of candidate region generation module. In the training stage, the seven sub-dictionaries are trained by seven types superpixel regions. In the testing stage, the discriminative dictionary is proposed to classify the segmented superpixel regions and fuse the same label areas to generate left ventricular candidate regions.

Initial Oversegmentation
The initial oversegmentation is to generate a pile of superpixel candidates fast in an explainable way by borrowing the merits of the structural tightness and edge fitness of the cardiac images.
Due to the imperfection of magnetic resonance equipment, cardiac MRI is prone to artifacts and noise. In the preprocessing stage, the SLIC superpixel segmentation algorithm utilizes the redundant information between pixels and the effects of artifacts and noise can be eliminated through feature similarity. We use unsupervised SLIC oversegmentation to obtain initial superpixels because it has compact structure, good edge fitness, and flexible adjustment. In addition, the algorithm has low complexity and does not require a preliminary training process. Nevertheless, it is necessary to fuse them since the structural organization is significantly different and the superpixels do not cover the true left ventricle region exactly. Furthermore, according to the results of K-means clustering, the superpixel regions can be coarsely divided into six types related to left ventricle regions and one to background region. In this way, these seven types of regions will be further classified to their desired labels by learning strategies.

Discriminative Dictionary Training
The oversegmented regions usually differ from the true regions due to the large gray discrepancy between the myocardium and the blood pool as well as the intensity inhomogeneity in the surrounding tissues. To handle this, we propose a discriminative dictionary learning method to fuse the myocardium and blood pool into the same region and then improve the accuracy of scale adaptive anchors generation which mainly depends on the results of these oversegmented regions. Traditional dictionary learning method is not suitable for this problem since it lacks discriminative power for each type of data. If the dictionary specific to each kind of data is learned, the sub-dictionaries are able to represent the data with same properties (i.e., blood pools or myocardium), while their representation ability for other kinds is poor. However, the nearby superpixels in the sub-dictionaries are usually overlapped to represent the partial object and when the left ventricle label is incorporated to enhance the class information in advance, the myocardium and blood pool will be taken into a same type for training a certain type of sub-dictionary in the dictionary, so it becomes possible to put both into one category in testing stage.
The discriminative dictionary learning can explicitly constrain the type of atoms in the dictionary to build a discriminative sub-dictionary [45], so we use it to increase the model's ability by learning a dictionary with strong reconstruction and discriminating capabilities. Specifically, the objective of our model is where I(Y, D, X) is the reconstruction term that aims to increases the dictionary's ability to identify each type of data during the reconstruction; f (X) is the function of sparse coding that can enhance the discrimination of sparse coding for each type of superpixel region. 1 and 2 are constants to adjust the weight of two parts. Furthermore, the definition of discriminative dictionary model is defined as which corresponds to a certain type of superpixel region, denoting the superpixel region, i is a sparse encoding y i on D, and X i i is the sparse encoding of the i-th sub-dictionary of X i . There are three constraint terms on the right side of the equation, where ||y i − DX i || 2 F is the reconstruction error. In the identification, it is necessary to ensure that the reconstruction error is sufficiently small. ||y i − D i X i i || 2 F is the discrimination term that tends to minimize the error between y i and a sub-dictionary D i and to identify the class most similar to the dictionary. ∑ k j=1,j =i ||D j X j i || 2 F is the constraint term of the remaining sub-dictionaries, which can make the F-norm of the remaining sub-dictionaries as small as possible.
The discriminant sparse coding term f is defined as where S w (X) and S B (X) are the intra-class scatter and the inter-class scatter matrices, respectively, i.e., T . m and m i are the mean of total sample and samples in the i-th class, respectively. n i is the number of training samples in the i-th class. By making the intra-class error as small as possible and the inter-class error as large as possible in the sparse coding, f (X) can enhance the ability of model discrimination.
So Equation (1) can be transformed as follows, The solution of the model is divided into two parts. D(D = [D 1 , D 2 , · · · D c ]) is firstly fixed and X(X = [X 1 , X 2 , · · · X c ]) is updated, then D i and X i are updated classwisely during the solution. The weighting parameters 1 and 2 are settled as 0.1 and 0.001 respectively. Specifically, when D is fixed, X can be updated by using the IMP algorithm [46], that is, It should be noted the remaining X j (j = i) are all fixed when X i is calculated. The above formula can be simplified as where m and m k are the means of all classes and the k-th class of sparse coding, respectively. When X is fixed, then D i is updated class by class as follows, where X i is the sparse encoding of the entire training data set Y on D i . When D i is calculated, the remaining D j (j = i) are all fixed, and they are updated in the following steps.
(1) Initialize dictionary D. In each type of sub-dictionary D i ( i = 1, 2 · · · c), randomly select n atoms as c × n initialization atoms of D, and there are totally n × c atoms.
(3) Fix X, update dictionary D, and calculate it by using Equation (7).
(4) Return to step 2 and iteratively update until the values of D and X in two adjacent iterations are close enough, or until the maximum number of iterations, then output X and D.

Dictionary Testing
In the testing stage, a single class of sub-dictionary is used to reconstruct a certain superpixel region y and excludes the influence of the remaining sub-dictionaries. Let β bd a sparse encoding of y on dictionary D and it is solved as follows where η 1 and η 2 are constants. Furthermore, the label is estimated by combining reconstruction error and sparse coding error as follows, where the smallest L i corresponds to the classified superpixel region. The weighting parameters η 1 and η 2 are settled as 0.1 and 0.005 respectively. Once the minimum L i is obtained, the superpixel region is assigned a label corresponding to the sub-dictionary D i .

Superpixel Region Fusion
Since initial superpixel region of each image is able to be labeled according to the discriminative dictionary learning, the objective of fusion strategy is to merge the regions with the same label. In the ideal case, the left ventricular region can be completed distinguished by discriminating the superpixel region of the ventricle through a discriminative dictionary, but in practice, due to the complex distribution of various tissues and organs in cardiac MRI images, as well as the subtle differences between some tissues and organs and the left ventricle, some organs that scatter in different positions in the image are easily misjudged as left ventricle, so in the following the detection network is described for accurate classification and localization. Figure 5 shows some superpixel region fusion examples, from where it can be seen that in comparison to the results of initial SLIC oversegmentation, the left ventricular regions are more remarkable after the processing of proposed discriminative dictionary learning and label fusion. However, the results of regional fusion perform still not very well in some highly complicated images. So it is necessary to compensate them and correct the partially misjudged regions by designing the detection network with the scale adaptive anchors.

Detection Network and Setting of Scale Adaptive Anchors
The detection network in this paper adopts the regional proposal network (RPN) that is modified with scale adaptive anchors. In order to match the number of anchors slipped out of each point of the RPN, the scale adaptive anchors generated correspond to 9 anchors for each point. In the training phase, the Euclidean distances are computed between the center of the label and all the centers in the training images that are identified as left ventricular candidate regions, then the smallest distance is regarded as a significant left ventricular candidate region. In short-axis cardiac MRI images, the left ventricle region is close to a circle, so the precise bounding box should be close to a square. The minimum bounding box is generated in the salient left ventricle region in each training image, which is called the saliency anchor. If this saliency anchor is rectangular, a long and short side of the saliency anchor are used to generate two square enclosing boxes in the same center, which are three basic enclosing boxes. If the saliency anchor is square, the diagonal length and semi-diagonal length of the saliency anchor are used to generate two square enclosing boxes in the same center, which is the three basic bounding boxes. The length and width of the 3 basic bounding boxes are multiplied by the scale scaling factor s 1 and s 2 to generate 6 bounding boxes. The centers of the 9 scale adaptive anchors are mapped onto the feature map to replace the candidate anchors corresponding to the current point in the RPN. This setting approach will make the anchors have a good IOU value even if the regional fusion may produce some errors.
In order to further alleviate the shortage of positive samples, the sample augmentation was taken, where nine scale adaptive anchors' centers were generated by randomly disturbance in the significance anchor and replacing the candidate boxes corresponding to the points around the current RPN points. In this way, the positive samples with diversity are constructed, and the high-quality negative samples around the threshold boundary are also built to join the proposal layer, so as to enhance the network's ability to identify the hard to distinguish negative samples. In the testing stage, the scale adaptive anchors are mapped to the RPN and proposal layer on all left ventricular candidate regions for detection.

Left Ventricular Region Scale Adaptive Anchors Rapid Generation Module
The cardiac MRI slices contain a series of images of multiple individuals at different times. To speed up the detection, this paper proposes a rapid generation module of adaptive anchors suitable for left ventricular regional scale for the same individual. The overall framework is shown in Figure 6. Due to the complex tissue distribution and low discrimination of various organs in medical MRI images, it is easy to make the estimated center position significantly different from the true center if the dictionary learning algorithm is directly applied. The scale adaptive anchors in this paper are generated in the same center; the accurate results are hard to obtain if the center position is not accurate. To handle this, the detection boxes are divided into 9 small blocks, and the estimated center positions of these blocks are averaged to obtain the prediction center. The final central location is computed as the weighting location of the prediction center and the prior center respectively. At the new center position, the left ventricular saliency anchor is obtained through a multiscale filter template, and the scale anchors are set on the saliency anchor in the same manner. Subsequent detection does not need to generate significant anchors through discriminative dictionary classification and label fusion, which effectively improves the detection speed and achieves competitive results. The training of template constructs the optimal correlation filter h by establishing a minimized cost function as follows where g is Gaussian response and l represents a certain dimension of the feature. λ is a regular term coefficient for eliminating the influence of the zero frequency component in the f spectrum. It is empirically set as 0.01. This filter can be solved by a transformed into the frequency domain as follows, where G is the conjugate of the two-dimensional Gaussian response to the two-dimensional Fourier spectrum; F k is the two-dimensional Fourier spectrum of the k-th dimension of the current frame, F k is the conjugate of the two-dimensional Fourier spectrum of the k-th dimension feature of the current frame, H is split into numerator A and denominator B, and it is iteratively updated as follows where the learning rate θ is set as 0.025 and the target positions of 9 small blocks can be obtained by solving the maximum correlation filter response value as follows In the above formula, Z l is the two-dimensional Fourier spectrum of the k-th dimension feature of the new frame. Once the new center point is estimated, candidate patches of different scales are obtained at the new center, and the best matching scale is obtained through the one-dimensional scale filtering template. The method of obtaining is similar to the position estimation, the choice of scale is shown as a n w × a n h; n ∈ − s−1 2 , · · · s−1 2 , where w and h are the width and height of the previous frame, respectively, and s = 33 is the number of scales.
The tracking of each individual is mainly divided into two steps: tracking of location and tracking of changes in scale. The flowchart of the main tracking algorithm is as follows: (i) Location estimation.
(1) According to the detection position of the current frame, the detection region is averagely divided into 9 regions Z trans for training 9 A trans t and B trans t .
(2) At the next frame, using 9 Z trans and 9 small blocks of A trans t and B trans t to calculate y trans respectively.
(3) Calculating max(y 1 trans ) · · · max(y 9 trans ) and averaging the center positions of the 9 small blocks to obtain the predicted center position P 1 . The new center position P t of the target is obtained by weighting and averaging P 1 and the prior center P 2 with weight coefficients w 1 and w 2 .
(ii) Scale estimation. Using the current frame information, the position template and the scale template are updated by the above update equations, where A trans t = GF l and B trans t = ∑ d k=1 F k F k + λ are used to generate filtering templates in the process, also Z trans is the Fourier spectrum.

Cardiac Mri Data Set
The experimental data are from the cardiac atlas project (CAP) database. This data set was created by a collaborative database of multiple organizations [47], among which MRI acquisition equipment includes Siemens (Avanto 1.5T, Espree 1.5T, Symphony 1.5T), Philips (Achieva 1.5T, 3.0T, Intera 1.5T) and GE (Signa 1.5T), etc., due to different acquisition equipment, the image sequence is very different, the image size is 192 × 156 ∼ 512 × 512, and the voxel resolution is 0.7 × 0.7 × 6 ∼ 2.0 × 2.0 × 2.0 (unit : mm 3 ). The data set contains a total of 83 patients with short-axis cardiac MRI. In one cardiac cycle, it contains about 18 to 35 volumes according to different time sampling intervals, and each volume contains about 8 to 17 images according to different space sampling interval along the short axis. Due to the imperfection of magnetic resonance equipment and the specificity of the subject, the MRI images inevitably present a certain degree of uneven brightness and intensity inhomogeneity. In our experiments, the bias field correction was performed on the image in the preprocessing stage, and the images with incomplete labels in the data set were eliminated. Totally, the experimental data set consists of 19,448 cardiac MRI images.
The myocardium region in each image was manually labeled by two well-trained students. Since this labeling was a non-trivial and challenging problem due to the fuzzy organ boundaries and large anatomical variability, their results were carefully cross-checked and further checked by a radiologist to made a final result as gold standard for evaluation. For each subject, this processing took about 2 h.

Evaluation Metrics
The detection performance of the model is comprehensively evaluated by the following measurements. Considering accuracy and recall measures usually received special attention in the medical image processing field, in our experiment the precision (P), recall (R), F1 score, AP50 and AP75 are commonly taken as evaluation indexes in our object detection tasks. In order to reflect the adaptability of the model to different individuals, 10 different individuals were randomly selected from the test set and evaluated using the AP50, AP75, and mAP indicators. In order to ensure the uniformity of the evaluation results, the threshold score of 0.7 in Faster RCNN is used in the experiment. The prediction confidence score is greater than 0.7 as the correct detection, otherwise as the false detection.
When calculating P and R, the indicators are defined as follows. The number of samples in which the left ventricular region is correctly predicted as the left ventricular region is TP (true positive). The number of samples in which the background region was incorrectly predicted as the left ventricular region was FP (false positive). The number of samples in which the background region is correctly predicted as the background region is TN (true negative). The number of samples in which the left ventricular region was incorrectly predicted as the background is FN (false negative). Based on them, accuracy represents the ratio of the number of positive samples that are correctly predicted to the ventricular area to the number of all divided positive samples, i.e., The recall rate is the ratio of the number of positive samples that are correctly predicted in the left ventricle area to the number of all true positive samples. i.e., F1 is the harmonic mean of the P and R measures. It can be used to measure the similarity of the two sets, i.e., The experiments were carried out on 2.0 GHz Intel CPU, 48GB RAM, NVIDIA RTX 2080Ti, Linux 64 bit PC. the programming environment was Anaconda 5.0.1 (Python 3.6), tensorflow1.4, keras2.0.8.

Left Ventricle Candidate Region Building Module
For the dictionary training data set, 10 different individual legends are randomly selected from the total training set, and a complete slice sequence is randomly selected from the 10 different legends to build the discriminatice dictionary. In the process of experiment, the weight coefficient follows the original model. Before training of discriminative dictionary learning, PCA is used to reduce the dimensionality of various training data and retain core atoms. The initial segmentation number in the experiment is set to M. If M is too large, the fitting degree of the edges of various tissues and organs will not be accurate, and the resulting IOU after the merging processing will be low and vice versa. In order to get optimal parameters, the relationship between the IOU after merging and the initial number of segmentation was experimentally investigated and the results are shown in Figure 7, from where it is seen that the IOU achieves the highest value when M is 400. Therefore, the number of initial superpixel is finally determined as 400.

Detection Network Module
In the network parameter setting, the data augment approaches such as horizontal flip-up extended data set were used. The optimization approach was taken as the stochastic gradient descent method and the learning rate was set to 0.001 and the training batch was 256 anchors. According to the candidate box IOU as a positive and negative sample partitioning criterion, the RPN settled IOU > 0.7 as a positive sample, IOU < 0.3 as a negative sample, and the NMS [48] threshold was set to 0.7. In the proposal layer, IOU > 0.5 was set as a positive sample, and 0.1 < IOU < 0.5 was set as a negative sample. In order to adapt to the cardiac MRI data set, the size of the anchor was adjusted as 64 × 64, 128 × 128 and 256 × 256. The number of iterations was set to 50,000 and the weight decay rate was 0.0005.
The training loss curves including regression part and classification part are shown in Figure 8, from where it can be seen that, with the increase of the number of iterations, each loss value gradually decreases, indicating that the model gradually adapts to the transformation of candidate regions and performs effective classification and regression adjustment on candidate anchors. However, there are some jumps in the curve. The main reason for the jumps is that there are larger errors between the significant left ventricular region and the real region in small parts of MRI; in addition, there are also some impacts rising from different image parameters collected by different devices.

Saliency Anchors Fast Generation Module
In the left ventricular scale adaptive anchors rapid generation module in the experiments, λ was set to 0.01; the learning rate was set to 0.025; the scale factor was set to 1.02, and the number of scales was set to 33. In our experiment 20 slice sequences were randomly selected in the training set, and their label boxes were used as ground truth to explore the relationship between the prior center weight coefficient and the average IOU. The experimental results are shown in Figure 9a. It can be seen from the figure that the curve has small fluctuations and the average IOU is above 0.6, when the weight coefficients w 1 = 0.2 and w 2 = 0.8, the maximum average IOU can be obtained, so they are selected as the optimal parameters.

Scale Adaptive Anchor Scaling Factor
The scale scaling factor in our model is a key parameter in generating scale adaptive anchors; if it is not set properly, the anchors will be difficult to fit different scale ventricular regions. The relationship between the average IOU and the scaling factor was explored by randomly selecting 5000 images in the training data set and then calculating the average IOU of the scale adaptive anchors under different scale factors. The results are shown in Figure 9b, where the IOUs of the two curves first increase and then decrease along with the increasing scale factor, and both reach their maximum values near the original basic frame (that is, the scale factor is 1). In sequence tracking generating scale adaptive anchors, the two scaling factors with the largest average IOU values are 1.1 and 1.2 respectively; in discriminant dictionary generating scale adaptive anchors, the two scale scaling factors with the largest average IOU values are respectively 0.8 and 0.9. So when the scale adaptive anchors fast generation module generates scale adaptive anchors, the scale scaling factor s 1 was set as 1.1 and s 2 was set as 1.2. Furthermore, when the discriminant dictionary generates scale adaptive anchors, the scale scaling factor s 1 was set as 1.1 and s 2 was set as 1.2.

Experimental Results
To thoroughly investigate the performance of the proposed method, three variant methods were built as follows. (1) Discriminant dictionary+Faster method. This variant one focuses on the effect of discriminant dictionary module, by combining it to the Faster RCNN; (2) Correlation filter + Faster method. This variant one focuses on the correlation tracking module; (3) Correlation filter + Proposal method. This variant one combines the left ventricular scale adaptive anchors generation module and the proposal layer to improve the detection speed. Furthermore, to more comprehensively evaluate the proposed algorithm, the state-of-the-art two-stage methods such as Faster RCNN [22], one-stage detection algorithms such as representative Retinanet [18], Yolov3 [14], SSD [16], and anchor-free method such as Centernet [33] are also introduced for comparison. The results of these comparative methods are shown in Table 1. In order to explore the adaptability of the detection algorithm to different individuals, the model was tested on 10 different individuals randomly selected in the test set. Using the ground-truth images as the standard, the test results were evaluated using AP50, AP75, and mAP. The comparison results are reported in Table 2. As can be seen from the above table, the model has different adaptability to different individuals. The proposed method and its variant can achieve significant performance improvements based on Faster RCNN on most individuals. In the evaluation of the mAP50 index, the method Correlation filter + Faster in this paper achieved competitive results in all comparison methods, which is 4.1% higher than Faster RCNN. In mAP75, the method and its variants outperforms 2.2∼3.6% improvement in comparison to Faster RCNN. but less than RetinaNet. Figure 10 plots the P-R curve of the proposed method and a series of classic detection algorithms on the testing set, as well as the recall-IOU curve. It can be seen from Figure 10a that the proposed method and variants in this paper are better than Faster RCNN. After the detection algorithm reaches a certain recall rate, the recall index will not change, and the detection accuracy will gradually decrease. The reason is that below a certain point, no left ventricular region has been recalled, and some error regions have been recalled. In the comparison of a series of classic detection algorithms, the method Correlation filter + Faster in this paper achieved competitive results. As can be seen from the above Figure 10b, the method and variants of this paper have better recall index than Faster RCNN. By adding scale adaptive anchors, it effectively prevents the missed detection of small scale left ventricular areas in Faster RCNN. Under the condition of IOU < 0.7 In this paper, the method Correlation filter + Faster will gradually surpass Retinanet and achieve competitive results on the recall rate index. In order to compare the P-R curves of each method, the enlarged parts of the curves are plotted in the right column. It can be seen from the zoomed parts and the larger area under the PR curves that when IOU < 0.7, the proposed Correlation filter + Faster method achieves better performance than the compared methods.
As can be seen from Figure 11, the method in this paper has achieved competitive results in comparison with the detection results of a series of methods, which can effectively avoid missed and false detections in Faster RCNN, Because Faster RCNN generates undifferentiated anchors on the entire map, when areas with similar shapes and appearances to the left ventricle are more likely to be misdetected, the anchors generated at the same time have poor robustness to small-scale regions, which leads to easy missed detection in images with smaller left ventricular regions. The left ventricular region scale adaptive anchors generated by the proposed method have adaptability to left ventricular regions of different scales, enhancing the significance of candidate anchors around the left ventricle region and improving the network's ability to identify difficult negative samples. In the detection stage, this paper explores whether the number of scale adaptive anchors is better. At the scale scaling factor of 0.4 to 1.5, two sets of scale adaptive anchors are added each time, and RPN and the proper layer are gradually added. After experiments were performed on the testing set, AP50 and AP75 were computed for evaluation by using the label box as the standard. The results are shown in Figure 12. As can be seen from the figure, the AP50 and AP75 indicators have no significant relationship with the number of groups of scale scaling factors, and the indicators change within a small range as the number of groups changes.

Overall Performance
The proposed method and variants of this article have presented competitive performance in the experiments. As can be seen from the tables, they have been improved to varying degrees on the basis of Faster RCNN. In the comparison of the AP50 and AP75 indicators, the variant method Correlation filter + Faster achieved the most advanced results on the AP50 indicator, which increased by 2.1% on the basis of Faster RCNN. The variant method Correlation filter + Proposal of this paper has achieved comparable detection results with retinanet on the AP75 indicator and has increased by 1.3% on the basis of Faster RCNN. In terms of the recall rate index, the method and variants of this article are better than Faster RCNN. Under the condition of IOU = 0.5, the Correlation filter + Faster method has achieved a competitive result, which is 2% higher than Faster RCNN. Under the condition of 0.75, the variation method Correlation filter + Proposal in this paper achieved a detection performance comparable to Retinanet, which improved 1.2% on the basis of Faster RCNN. The experimental results show that the left ventricular scale adaptive anchors proposed in this paper can prevent the left ventricle area from being missed to some extent due to the adaptability to the left ventricular area of different scales.
In terms of accuracy index, the method Discriminant dictionary + Faster and the variant method Correlation filter + Proposal both achieved certain improvement on Faster RCNN. The main reason is that after the left ventricular scale adaptive anchors are added to the non-differential anchors in the RPN, the candidate boxes around the left ventricle region are more significant, making this part of the candidate boxes more prominent in the network during training and testing and in MRI; although there are regions similar to the left ventricular region, the surrounding candidate anchors are not significant, so the occurrence of false detection is suppressed to a certain extent. At the same time, it should be noted that the Correlation filter + Faster is lower than the Faster RCNN in the accuracy index, and the performance at IOU = 0.75 is lower than IOU = 0.5. The possible reasons are conjectured as follows. In the training, the scale adaptive anchors generated by dictionary learning are used. While in the detection, the scale adaptive anchors generated by tracking are used. The scale changes of these two anchors during the regression process and the change of the moving distance are different, which leads to the low IOU of the detection boxes. Some of the detection boxes were not included in the correct number of detection due to the low IOU, so that certain accuracy was lost.

Time Cost
The proposed method was compared with a series of classic detection algorithms on the detection time index. During the experiment, the models were performed under the same experimental conditions as much as possible. The results are reported in Table 3. Table 3. Comparison of detection time between the proposed method in this paper and a series of detection algorithms (s/image), where D-d+Faster, C-f+Faster, C-f+Proposal denotes the proposed Discriminant dictionary + Faster, and Correlation filter + Faster, Correlation filter + Proposal methods, respectively. As can be seen from the table above, in the one-stage algorithm, the detection speed of Yolov3 has achieved a leading advantage. The method of this paper, Discriminant dictionary + Faster and Correlation filter + Faster, has a higher time cost than Faster RCNN. In particular, in the Discriminant dictionary + Faster method, each time a discriminative dictionary classifies a superpixel region, iteration is needed, resulting in high time cost. After using the left ventricular region scale anchors to quickly generate the module, the time cost is significantly saved, and the variant Correlation filter + Proposal method, because the previous RPN and other convolutional layers are removed, directly cascades the left ventricular-scale adaptive anchors module with the proposal, effectively increasing the speed and gaining 7% on Faster RCNN.

Limitation of Our Method
The detection performance of the proposed method is significantly improved in comparison to Faster RCNN, which can effectively prevent false detection and missed detection. The comparison with a series of detection methods validates the effectiveness of the improved method in this paper. However, due to the complex tissue and organ distribution in MRI, the shape characteristics of some left ventricle areas are not obvious, and some areas are highly similar to the left ventricle; it is inevitable that a small amount of missed or false detection will occur. Figure 13 reports some examples of false or missed detection by the proposed method. As can be seen from the figures above, the occurrence of pseudo-detection is mainly in areas where the appearance is similar to that of the left ventricle. Missing detection mainly occurs in the area where the the left ventricle shape features are not obvious and the size is too small. In further research, the adaptive and discriminant dictionary learning will be investigated for this challenging case.
It should be noted that this study treated 3D image volumes as 2D images for left ventricle detection in the training stage and testing stage. The 3D CNN was not applied to build the detect network mainly due to two facts. One was that the anisotropic property in the volumes makes 3D CNN hard to extract the feature along the short axis. Typically, the Z-interval is 6mm while the XOY spacing is 0.7mm in the data set. The other reason was based on the consideration of the training time of 3D CNN in comparison to 2D CNN.
Nevertheless, the structure of the proposed method cannot be called independent 2D CNN since a fast-scale adaptive anchors generation module is designed using correlation filtering which intrinsically belongs to a tracker. In each individual, the tracking of location and changes in scale is based on the observation that the positions of left ventricles in the images of the same individual are constrained in a similar location. In our future work, the proposed method will be extended to handle the structure of 3D CNN for detecting left ventricle and the tracking module will focus on the image volumes at different time points.

Conclusions
In this paper, in view of the large span of the left ventricle area in cardiac MRI, varying size of ventricular area, and the remarkable gray level and shape difference of the myocardial blood pool, a CNN detection method based on discriminative dictionary learning and sequence tracking is proposed for left ventricular localization. The left ventricle target region was fused by discriminating dictionary-classified superpixel region labels, and multi-scale adaptive anchors were generated in the target region. Combining with non-differential anchors in region proposal network, the left ventricle is detected by regression and classification strategy. To solve the problem of slow classification efficiency in discriminative dictionary learning, a sequence-tracking-based left ventricular scale adaptive anchors rapid generation module was proposed in the same individual. The proposed method and its variants were performed and compared to some classical algorithms in target detection on the public available heart atlas data set (CAP). The experimental results show the effectiveness of the proposed method in a series of evaluation indicators and that this method has the most competitive results on some evaluation indicators.
Considering the left ventricle in cardiac MRI is a three-dimensional tissue, our future work will focus on extending the proposed method to the three-dimensional detection of cardiac tissue and further estimate the posture of the heart in three-dimensional space.
Author Contributions: All authors contributed extensively to the study presented in this manuscript. X.W. and F.W. contributed significantly to the conceptualization and methodology of the study. F.W. designed the network and conducted the experiments. F.W. and Y.N. provided, marked, and analyzed the experimental results. F.W. and X.W. prepared and wrote the original draft. Y.N. prepared the writing-review, editing, and modification. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The data presented in this study are publicly available. Refer to the description of each data set. reference papers. The authors also would like to thank the anonymous reviewers for their valuable comments, suggestions, and enlightenment.

Conflicts of Interest:
The authors have no conflict of interest in the paper.