MyI-Net: Fully Automatic Detection and Quantification of Myocardial Infarction from Cardiovascular MRI Images

Myocardial infarction (MI) occurs when an artery supplying blood to the heart is abruptly occluded. The “gold standard” method for imaging MI is cardiovascular magnetic resonance imaging (MRI) with intravenously administered gadolinium-based contrast (with damaged areas apparent as late gadolinium enhancement [LGE]). However, no “gold standard” fully automated method for the quantification of MI exists. In this work, we propose an end-to-end fully automatic system (MyI-Net) for the detection and quantification of MI in MRI images. It has the potential to reduce uncertainty due to technical variability across labs and the inherent problems of data and labels. Our system consists of four processing stages designed to maintain the flow of information across scales. First, features from raw MRI images are generated using feature extractors built on ResNet and MoblieNet architectures. This is followed by atrous spatial pyramid pooling (ASPP) to produce spatial information at different scales to preserve more image context. High-level features from ASPP and initial low-level features are concatenated at the third stage and then passed to the fourth stage where spatial information is recovered via up-sampling to produce final image segmentation output into: (i) background, (ii) heart muscle, (iii) blood and (iv) LGE areas. Our experiments show that the model named MI-ResNet50-AC provides the best global accuracy (97.38%), mean accuracy (86.01%), weighted intersection over union (IoU) of 96.47%, and bfscore of 64.46% for the global segmentation. However, in detecting only LGE tissue, a smaller model, MI-ResNet18-AC, exhibited higher accuracy (74.41%) than MI-ResNet50-AC (64.29%). New models were compared with state-of-the-art models and manual quantification. Our models demonstrated favorable performance in global segmentation and LGE detection relative to the state-of-the-art, including a four-fold better performance in matching LGE pixels to contours produced by clinicians.


Introduction
Myocardial infarction (MI), commonly referred to as a 'heart attack' occurs when an artery supplying blood to the heart is abruptly occluded.It is caused by the rupture of an atherosclerotic plaque in the wall of the artery, triggering the clotting cascade and leading to vessel occlusion.This may result in severe damage to the heart muscle which may be irreversible (scar).The extent of scarring following more severe heart attacks (ST segment elevation MI, or STEMI) may drive enlargement of the heart, and is associated with worse prognosis (increased risk of death and subsequent heart failure) [1,2].According to a report from the British Heart Foundation (BHF) in 2020, MI accounts for approximately 100,000 hospital admission annually.It is estimated that there are 1.4 million individuals alive in the UK today who have survived an MI (1 million men and 380,000 women) [3].
Cardiovascular magnetic resonance Imaging (MRI) provides accurate non-invasive diagnosis of MI.The late gadolinium enhancement (or LGE) technique [4,5], uses gadolinium-based contrast agent and specified magnetic resonance pulse sequences to provide a reproducible method for identifying and quantifying MI.LGE-CMR is recognized as the "gold standard" non-invasive method for visualizing and diagnosing MI, and also provides vital prognostic information following MI.A number of methods are available for the quantitative assessment of MI size (from LGE images), including visual assessment, manual planimetry, and semi-quantitative methods (such as full width at half maximum [FWHM]) [6,7].However, to date, there is no "gold standard" fully automatic method for MI detection and quantification.
In the past decades, several groups of researchers have been working to develop either semiautomatic or fully automatic methods for the detection and quantification of MI from MRI scans.For example, Eitel, et al. [8] proposed a standard deviation (SD) method for the quantification of the salvaged myocardium area extent after reperfusion.Amado, et al. [9] used the FWHM criterion to confirm that MI can be sized accurately up to 30 minutes after contrast administration.Flett, et al. [10] compared seven quantification methods including manual quantification, 2, 3, 4, 5, or 6 SDs above remote myocardium, and the full FWHM method.They confirmed that FWHM methods provide the closest result to manual quantification and has the highest reproducibility.Hsu, et al. [11] measured the MI size of 11 dogs based on the automated feature analysis and combined thresholding (FACT).Comparison of the proposed FACT algorithm with FWHM, intensity thresholding and human manual contouring confirmed that human contouring may overestimate MI size and the higher accuracy can be obtained from FACT than intensity thresholding.Tong, et al. [12] proposed the current interleaved attention network (RIANet) for the cardiac MRI segmentation based on ACDC 2017.Shan, et al. [13] proposed the segmentation method based on spatiotemporal generative adversarial learning without contrast agents.Xu, et al. [14] proposed long short-term memory recurrent neural network (LSTM-RNN) for MI detection without contrast agents.Héloïse Bleton [15] proposed left ventricular infarct location based on Neighbourhood Approximation Forests (NAF) and compared with the stack autoencoder method based on 4D cardiac sequences.Fahmy, et al. [16] developed a UNet DCNN model for automatic cardiac MI quantification with stratified random sampling.Bernard, et al. [17]'s review reported that for the ACDC2017 challenge of cardiac MRI assessment, many researchers proposed the using of UNet [18] for the segmentation of myocardium, right ventricle and left ventricle.Fahmy, et al. [19] also proposed using the UNet method for the MI segmentation based on data collected from patients with and without MI.
Although significant progress has already been made in assisting clinical experts to quantify the size of MI in affected patients, major hurdles still remain in this vitally important area.For example, manual tracing of contours is subjective and prone to low reproducibility with high intra-and interobserver variability, as well as being labor-intensive, with associated costs.Existing semi-automatic methods to localize MI are affected by biases introduced through tracing of the left ventricle (LV).All those challenges introduce significant uncertainty when detecting and quantifying the scar from Cardiovascular MRI

Images
In this paper, we propose a new system, named MyI-Net, to achieve end to end, fully automatic MI detection and quantification to overcome above challenges.In order to optimize performance, we propose a new class of appropriately engineered deep leaning models.These models combine initial feature extraction (realized through ResNet and MobileNet -based models) followed by the Atrous Spatial Pyramid Pooling (ASPP) to adjust the receptive field to preserve more image context.New feature maps are generated via fusing high-level features from ASPP and low-level features from one specific layer of corresponding networks.Finally, the segmentation result is obtained via up-sampling to eventually recover the spatial information by an add-on module.
In order to deal with the other source of uncertainty, the issue of inherently unbalanced datasets (the number of pixels corresponding to scarred tissue in an image is always considerably smaller than that of the pixels corresponding to muscle, background, or blood pool) while fully using all the data, we use an appropriately constructed weight matrix.As training datasets are always limited, and in order to increase robustness, we propose three different augmentation methods integrated in this model to make a diversified dataset for training.New models as well as their state-of-the-art counterparts which were used as baseline comparisons were trained and validated on 1822 unique MRI images collected in our lab from research patients with MI.
The rest of this paper is organized as follows: Section 2 provides a detailed account of materials, procedures of the data collection and the demographics of data.Section 3 presents our proposed methods including construction of the weight matrix, details of data augmentation and detailed information regarding performance metrics.Section 4 presents results of our experiments including time cost analysis, segmentation analysis and comparison with state of the art work as well as with manual segmentation produced by a human expert.Section 5 concludes and illustrates our proposed method, its limitations and future research.

Materials
The data was collected using Cardiovascular magnetic resonance (CMR) imaging.With gadolinium-based contrast agents and appropriate pulse sequences, CMR can provide clear differentiation between the infarcted and normal myocardium.To obtain LGE images, the patient is typically scanned 10-20 minutes after the intravenous administration of standardized, weight-adjusted dose of gadolinium-based contrast agent.
The data came from a variety of MRI scanners: for data collected from Siemens 1.5T scanners, the sequence parameters are as follows: slice thickness was 10mm, repetition time was 900ms, echo time was 4.91ms, flip angle 30o and Acquisition matrix -256/154.For Philips 1.5T, slice thickness was 10mm, repetition time was 4.87ms, echo time was 1.87ms, acquisition matrix -256/256.For Siemens 3T Skyra, slice thickness was 8mm with 2mm gap, repetition time was 43.29ms, echo Time 1.46ms and acquisition matrix -256/208.The data from different vendors and field strengths enables generalization of model results.The data is collected from patients with MI whose demographic data are shown in Table 1.

Automated segmentation of myocardial infarction: myocardial infarction-Net (MyI-NET)
As reported by the Association of American Medical Colleges (AAMC), in the US there will be an urgent shortage of physicians (approximately 122,000 by 2032) while the nation's population is still growing and aging [20].A similar situation is expected to emerge in the UK as only 2% of radiology departments have the ability to fulfil their imaging interpretation tasks within work hours as reported by the Royal College of Radiologists (RCR) entitled 'Clinical Radiology U.K. Workforce Census Report 2018' [21].Meanwhile, the report also highlighted that only 2% trusts and health boards in the UK have adequate interventional radiologists to provide for urgent procedures.Therefore, an automatic image interpretation service is urgently needed.
Deep learning has demonstrated great potential in biomedical data analysis with its powerful and advanced learning abilities [22][23][24].For example, Nam, et al. [Error!Bookmark not defined.]reported in 2018 that their proposed algorithm for malignant pulmonary nodules based on deep learning outperforms the radiologist in radiograph classification.Our pilot work based on CNN [26] also illustrated that deep learning can be used for automatic detection of MI.Here we make another step forward to improve the performance of MI detection based on machine learning, by proposing a new class of models: MyI-Net.Details of the proposed new class are provided below.
At the core of this new model class is the proposal to exploit a wealth of deep learning architectures whose efficiency has already been demonstrated in image processing applications.We will use these models as a part of the feature extraction process.Feature extraction is then combined with the Atrous spatial pooling -ASPP.The latter generates multiple receptive fields enabling us to catch information at different spatial scales in a balanced way.This is followed by an add-on module for the spatial information recovery.

Feature extraction by MI-ResNet
Feature extraction is based on deep CNN networks (see e.g.[25,27,28]).Figure 2 shows an example flowchart of relevant processes in a conventional CNN.As is shown in Figure 2, all layers, including convolutional layers, ReLu layers, pooling layers, are cascaded gradually.However, such simple and uniformly cascaded structures have severe technical drawbacks.Particularly, it may be hard to train deep conventional CNNs in practice due to the well-known problems of either gradient exploding or gradient vanishing.To circumvent this issue, here we adopt the ResNet model of CNN proposed by He, et al. [29] as the basic backbone model for feature extraction.We call this backbone model MI-ResNet.In contrast to conventional CNNs, ResNet provides a structure with short-cut connections by skipping one or more weight layers as shown in Figure 3.
Mathematically, the structure of CNN and ResNet processing blocks can be expressed as: in which,  −1 stands for the output from the previous layer, H_l is the output of the l-th layer in the conventional CNN's counterpart, and   is the output of a ResNet constructed from the original CNN by adding a short-cut connection (residual information).

Feature extraction by MI-MobileNet
ResNet architectures mainly focus on improving the accuracy of the deep network and ignore computation costs.Therefore, we consider MobileNetV2 as another potentially relevant backbone for our proposed MyI-Net.We call such architectures MI-MobileNet feature extractors.MobileNetV2 was proposed by Sandler, et al. [30], a research group in Google.Before the advent of MobileV2, MobileNet was first introduced by Howard, et al. [31] also in Google with the idea of depthwise separable convolution (DSC), which can dramatically reduce the model size and complexity.DSC can be described as by two components: depthwise convolution (DC) and pointwise convolution.DC applies a single filter to each input channel and the pointwise convolution applies 1x1 filters to create a linear combination of the output of DC layers.There also are batch normalization layers and ReLu layers to follow both the DC layer and pointwise convolution layer.The structure of DSC is shown in Figure 5.As the DC in MobileNet used the 3x3 filter, we therefore used the 3x3 filter in Figure 5 to show the difference between the structures of the standard convolution and DSC.Though MobileNet is rather small and computationally cost efficient, to make it more flexible in practical applications with the requirement of faster running and smaller structure, MobileNet utilizes the idea of the so-called width and resolution multipliers.Width multiplier makes the network uniformly thin at each layer, and the resolution multiplier is applied to the input image to further reduce the computation cost.The detail is introduced as follows: Suppose that the width multiplier is ε, then for a given layer with the number of input channels D I , the number of input channels become εD I .Likewise, if D O is the number of output channels then this layer's number of output channels will become ε  .Therefore, the computation cost of one DSC can be reduced to: where, D F stands for the spatial width and height of a square input feature map, and ε is taken in the interval (0 1].If we set the resolution multiplier equal to  ∈ (0 1] as well, then the computation cost can be described as: As is shown in Figure 6, a bottleneck in MobileNetV2 is characterized by that the first layer is the 1x1 convolution layer followed by ReLU6, the second layer is the DC layer and the final layer is 1x1 convolution layer without any non-linear operation.
In the residual block, the input of the corresponding block is combined with the output of final 1x1 convolution layer.The whole structure of MobileNetV2 can be found in [30].

Atrous Spatial Pyramid pooling
In the conventional convolution neural network, we can obtain more low-level and high-level features while the network goes deeper and wider.However, the problem is that this standard approach produces a relatively limited number of spatially local convolutional features.The latter, however may contain crucial information for semantic segmentation.For the semantic segmentation, conventional approaches therefore employ different methods to increase spatially relevant information content, like stacking more layers and up-sampling.Theoretically, the amount of spatially relevant information can be increased through a broader spectrum of convolutional filters used in the network: from small to large.The size of these filters is sometimes referred to as a receptive field.Thus, the overall receptive field sizes can be increased if we stack more layers.However, not all information in the receptive fields is equally effective or useful.Likewise, the up-sampling increases the receptive field but at the same time may negatively affect our capability to extract useful information about local context.In order to keep the context information, which is important, and decrease the ambiguity caused by local areas while maintaining the number of parameters in the receptive fields constant, Atrous convolution, also named as dilated convolution, was proposed [32].Atrous convolution is implemented via assigning zero values in the relevant weights of the filter.Formally, it can be expressed as: where,  stands for the dilation rate.When  = 1, it reverts to a conventional convolution.() represents the filter with size , () is the input and ℎ() is the output of the Atrous convolution.Figure 7 shows examples of the Atrous convolution with rate k=1, 2, 3.When k=2, 3, we obtain feature maps with larger receptive field as shown in Figure 7 (b) and Figure 7 (c).In order to avoid this problem and at the same time to utilize our data fully, we employ an appropriately chosen weight matrix to balance the contribution of data from different-sized classes whilst training the model.
Weight matrix assigns appropriate weights to each training sample when a training algorithm computes and subsequently uses a given loss function.In this work, the highest weight is assigned to data associated with the scar tissue, and smallest weight is assigned to data samples representing background pixels.Mathematically, the weight matrix we used is defined as follows: There,   represents the number of pixels in each class of the dataset, n represents the total number of categories/classes,   stands for the frequency,  represents indices, and   is the weight of each category/class.

Data augmentation
As the dataset to train, test, and validate our models was very limited (particularly for scar tissue), data augmentation was utilized to produce a more diversified dataset.The data augmentation methods used in this work include geometric transformations, such as rotation (from 0 to 360 degrees at random) and random scaling with scaling factors from 0.9 to 1.1 for the training dataset in Figure 10.

Performance metrics
In order to validate the proposed methods, we employed different performance metrics, including accuracy, bfscore, IoU, and per-image score.
Accuracy: accuracy at the pixel level is defined as the percentage of correctly identified pixels for each category, which is used by most semantic segmentation.Suppose the confusion matrix P, which stands for all the prediction results for the whole dataset is: where  stands for each pixel in the image I and    () stands for the ground truth and    () is the prediction result for z.   is the total number of pixels with label  and prediction output .If we have  categories, then we can get   =∑    =1 as the total number of pixels with label .  = ∑    as the number of pixels predicted as b.Then the global accuracy can be expressed as: A category accuracy is the total fraction of correctly detected pixels in that category.The global accuracy  is the fraction of all correctly detected pixels regardless of the category information, which can provide a quick and inexpensive measure of the segmentation algorithm.The mean accuracy is the average category accuracies: (10) Bfscore provides the information of how well the predicted boundary aligns with the ground truth boundary.As the contour quality contributes significantly to the segmentation result, therefore, in this research, we proposed to use the bfscore as one measure, which is mathematically expressed as the harmonic mean of the recall   and precision   to determine whether the predicted boundary matches to the ground truth boundary with a distance error tolerance ∂.The detail description is as follows: Let    be the boundary of the binary ground truth segmentation map for a specific class o with    () = �  () ==  � and ⟦⟧ be the Iverson bracket notation (2): Let    be the predicted binary contour map for the segmentation result    .Then, with a distance error tolerance , precision and recall for each class are defined as in which, ( ) stands for the Euclidean distance and  is usually set as 0.75% of the image diagonal.
Then, for the category o, we can get: To finally generate the bfscore for each image, we can average  1  over all classes.Similarily, we can average bfscore of each image over the whole dataset to obtain the dataset's bfscore.
Intersection over union (IoU), which is also known as the Jaccard similarity coefficient, can be utilized if we want to provide a statistical accuracy measure that helps to better reveal false positives.IoU is calculated by the ratio of correctly classified pixels to the number of ground truth and predicted pixels in that category.Weighted IoU (wIoU) is mainly used to measure the performance of the model tested on a disproportionally sized classes, which aims to exclude the impact of errors in the small classes on the aggregate quality score.
However, as our dataset is severely imbalanced across categories, the mean IoU may not be an appropriate measure.Therefore, we used the wIoU instead of the mean IoU to measure the performance of the proposed algorithm.
Per image score: as we need to avoid getting an algorithm which works extremely well on some images but poor on most images, it is necessary to check performance of the model not only for individual pixels in our tests set but also assess how the model works for each individual image.Second, the per image score can help to reduce the bias towards the large objects, which is because the missing segmented least objects have a small impact on the confusion matrix.Third, per image score enables drawing realistic comparisons of Max epochs 50 Mini batch size 10 Execute environment GPU Validation patience 4

Segmentation result based of proposed method
In order to explore flexibility of our method and optimise performance, we built different models based on different feature extraction methods and name the three corresponding models as: MI-ResNet50-AC, MI-ResNet18-AC and MI-MobileNet-AC for easy remember.Due to the data imbalance, the class weights were set as: 13.7678, 0.7802, 1.3923 and 0.0163 for the scar, muscle, blood and background, respectively, as is shown in Table 3. Table 3 As is seen in Table 4, for the global segmentation, MI-ReNet50-AC provides the best performance with global accuracy of 0.9738, mean accuracy 0.8601, wIoU 0.9647 and bfscore 0.6446.MI-ResNet18-AC is slightly better than MI-MobileNet-AC.However, it is seen in Table 5, that for the scar tissue, MI-ResNet18-AC provides the best performance in terms of accuracy and similar performance in terms of bfscore compared with MI-Res-Net50-AC.Figure 12 shows a bar chart for a clearer comparison of performance for all three proposed models.6 shows the confusion matrix based on each model for a clear performance comparison.The rows stand for the predicted class and the columns stand for the true class.The correctly classified categories are shown as the diagonal cells, and the incorrectly classified observations are shown as the off-diagonal cells.
The time analysis is based on the current training dataset.For the proposed three models: MI-MobileNet-AC, MI-ReNet50-AC and MI-ResNet18-AC cost the time 24'1'', 57'35'' and 24'50'' respectively.7 shows that computational costs of MI-RestNet50-AC are double those of the other two methods.MI-ResNet18-AC and MI-MobileNet-AC have similar computation cost.With the validation patience as 4, MI-MobileNet-AC, MI-MI-ResNet50-AC, MI-MI-ResNet18-AC stops at epoch 7, 2 and 10 respectively (all stop earlier than the maximal number of epochs we set for these experiments).

Segmentation result based on state of art methods
In order to demonstrate the advantage of proposed approach and models, we compared the performance of our models to that of the state of art models, including the conventional CNN and UNet (3).A summary of performance for these models in the task of global segmentation is shown in Table 8.As we can see from Table 8, our proposed model MI-ResNet50-AC provides the highest accuracy and bfscores (the Unet architecture trained on the same data provides global accuracy of 0.6332, mean accuracy 0.6222, with 0.6117 for the wIoU, and a bfscore of 0.1626).Remarkably, our network's bfscore is approximately four-fold higher than that of state-of-the art on our data.
In order to provide a clearer relation between performance of our proposed method and the most recent method Unet, we provide the scatter plot (based on per image comparison) as shown in Figure 13, Figure 13 (a), Figure 13 (b) and Figure 13 (c) show the per image global accuracy, average accuracy, bfscore comparison respectively by MI-Res-Net18-AC and Unet vs the MI-ResNet50-AC.We can find that from the perspective of global segmentation for each image, MI-ResNet50-AC surpasses MI-ResNet18-AC in some cases.For a few cases, MI-ResNet18-AC performs better than MI-ResNet50-AC.However, both MI-ResNet50-AC and MI-ResNet18-AC outperformed UNet for most images in this dataset in terms of global accuracy, average accuracy and bfscore.As can be seen from the confusion matrix (shown in Table 1) and Table 5, MI-Res-Net18-AC has the best performance for scar quantification compared to MI-ResNet50-AC and UNet. Figure 14 shows the scatter plot of the correctly detected scar element by MI-ResNet18-AC, MI-ResNet50-AC and UNet vs the ground truth based on per image, we can find that the MI-ResNet18-AC obviously has more cases close to the diagonal line.Figure 15 shows the scar detection including false alarms based on MI-ResNet18-AC vs the ground truth, we can find that though MI-ResNet18-AC can detect the true positives at a satisfactory rate, while the false alarm rate is still quite high for the clinical application.Therefore, in our future research, we need to pay extra attention to reduce the false alarm rate.For the clinical reference, we also show the scatter plot per case.It is apparent that further work may be needed to reduce the rate of false alarms.Error correction approaches may potentially be used to address this issue (4-6).Detailed exploration of such functionality, however, is outside of the scope of the current work.

Conclusion
In this paper, we proposed a new end-to-end method for automatic MI segmentation, as the detection and quantification of the MI is crucial for determining clinical management and prognosis.Although LGE-CMR ( 7) is the non-invasive "gold standard" method as it permits optimal differentiation between normal and damaged myocardium with the use of gadolinium based contrast agents and special magnetic resonance pulse sequences, to date, there is no fully automatic "Gold standard" method for the detection and quantification of MI.In this work we make a step forward towards achieving this aim with the hope to reduce the uncertainty brought by the technical variability and inherent bias of the data and labels.We propose a novel deep learning model, MyI-Net, which accommodates MI-ResNet, MI-MobileNet models as initial feature extractors and is equipped with ASPP with an add-on module for the recovery of the spatial information to compute the final segmentation output.Considering the limited size of the dataset, a data augmentation pre-processing step was integrated in our model construction pipeline.It is apparent from Figure 9 that our dataset was severely imbalanced as it contained primarily background elements, followed, in descending order, by blood pool, muscle and scar.A weight matrix was used in to minimize our classifier's bias towards any specific category.Performance of the algorithm is shown in Table 4, with the best performance for global segmentation being provided by MI-ResNet50-AC with a global accuracy as 0.9738, mean accuracy 0.8601, wIoU 0.9647, and bfscore 0.6446.In comparison with other state of art methods (Table 8), our model outperformed state-of-the-art architectures on the dataset we had access to.However, considering the detection of scar tissue, we found that MI-ResNet18-AC, being a smaller model than MI-ResNet50-AC, provides the highest scar detection accuracy and a bfscore similar to that of MI-ResNet50-AC.Furthermore, we compared the computation cost for the three proposed models.Table 7 shows that MI-ResNet50-AC required the largest amount of time as compared to the other two models, mainly because MI-ResNet50-AC is a deeper network.Considering the above summary, we integrated both MI-ResNet50-AC and MI-ResNet18-AC into our proposed system MyI-Net.In general, however, the choice of specific feature extractor model depends on a given task (e.g.MI scar detection, or global segmentation).However, one limitation of the work is the relatively small size of the dataset.To be able to build more accurate, higher-performing models, larger datasets may be required.These datasets are also necessary for testing the system at the level of individual patients.To improve generalization capabilities of the model, further data should be collected from a greater variety of scanner vendors; it should also include more data with MVO.Meanwhile, and based on the per image analysis, we can observe that the proposed model is not as stable as might be desired.Therefore, the algorithm needs to be further tuned to achieve more robust and stable performance.We are considering a fusion technique in the future to fully explore the potential of the proposed deep learning models.We also consider error correction approaches (5,6,8) in our future research to further improve the performance of our proposed method to minimize false positives.
The process, applied to MRI MI segmentation, is illustrated with a diagram shown in Figure 1.In Figure 1, X = (x 1 , x 2 , … , x n ) stands for the low-level features that can be extracted from the specific i-th layer of the base feature extraction network.Core outputs of the backbone deep learning model which have been used in the initial processing pipeline are referred to as the high-level features.

Figure 1 .
Figure 1.Flowchart of the proposed model.

Figure 3 .
Figure 3. Short-cut Structure of the ResNet Block.

Figure 5 .
Figure 5.The structures of standard convolution and DSC.MobileNetV2 architecture is a descendant from the base MobileNet in which further processing operations are added, namely bottlenecks.For the bottlenecks, there are two types of blocks; residual block and down-sizing block as shown in Figure6(a) and Figure

Figure 6 .
Figure 6.Residual block and Down-sizing block in MobileNetV2

Figure 7 .
Figure 7. Atrous convolution (the red dot means non-zero)In this paper, we propose using the Atrous Spatial Pyramid pooling (ASPP) method as an extra module cascaded to the feature extraction network as shown in Figure1.This extra module enables us to adjust and maintain constant size (weights-wise) of receptive fields across scales in the network.ASPP feature maps are generated via a 1x1 convolutions and three Atrous convolution with rate k, 2k and 3k.In the models we generated in this work, the value of k was set to 6. Their outputs are then fused together to form new feature maps.

Figure 8 .Figure 9 .
Figure 8. Structure of Atrous spatial pooling with feature extraction network

Figure 9
Figure9shows class frequencies of data in our dataset.As we can see from this figure the dataset is severely imbalanced.Imbalanced datasets, if processed without due care, could produce models that are biased towards the most common category.

Figure 10 .
Figure 10.Examples of data augmentation

Figure 12 .
Figure 12.Bar chart of the performance of three proposed models.

Figure 13 .
Figure 13.Bar Scatter plot of the performance based on per image of MI-ResNet18-AC, UNet vs MI-ResNet50-AC.

Figure 15 .
Figure 15.Scar detection based on MI-ResNet18-AC vs the ground truth including the false alarm.As can be seen from the confusion matrix (shown in Table1) and Table5, MI-Res-Net18-AC has the best performance for scar quantification compared to MI-ResNet50-AC and UNet.Figure14shows the scatter plot of the correctly detected scar element by MI-ResNet18-AC, MI-ResNet50-AC and UNet vs the ground truth based on per image, we can find that the MI-ResNet18-AC obviously has more cases close to the diagonal line.Figure15shows the scar detection including false alarms based on MI-ResNet18-AC vs the ground truth, we can find that though MI-ResNet18-AC can detect the true positives at a satisfactory rate, while the false alarm rate is still quite high for the clinical application.Therefore, in our future research, we need to pay extra attention to reduce the false alarm rate.For the clinical reference, we also show the scatter plot per case.It is apparent that further work may be needed to reduce the rate of false alarms.Error correction approaches may potentially be used to address this issue (4-6).Detailed exploration of such functionality, however, is outside of the scope of the current work.

Figure 16 .
Figure 16.Detection result per case.Series 1 includes the false alarm and series 2 only contains the true positives.

Table 1 .
Demographic data

.
Class weight

Table 4 .
Performance achieved by the proposed models

Table 5 .
Performance for each category based on proposed models

Table 6 .
Confusion matrix based on proposed models

Table 7 .
Time analysis of the proposed algorithm

Table 8 .
Comparison to the state-of-art methods