In our experiments, we first evaluated the performance of our “multi-modality” CNN model implemented to classify seven different modalities of medical images. Next, we analyzed its behavior in modality classification by highlighting a discriminative ROI of images in each modality class located using the proposed CRM. We also demonstrated the effectiveness of our CRM by comparing its localization performance with other existing visualization methods.
4.2. Localization Evaluation
Now, we apply the proposed CRM to identify a discriminative ROI of images in each modality class for characterizing the behavior of our CNN model.
Figure 4 shows several examples of heatmap images reflecting the CRM score for a given input image. These heatmap images were generated by normalizing the CRM score to the range [0, 255] and displaying the value above 20% of the max score of the CRM.
We can first see that our CRM consistently localized and highlighed a specific area within the images as the discriminative ROI for those in the classes of (a) abdomen CT, (b) brain MRI, (c) cardiac abdomen ultrasound, (d) chest X-ray, and (f) retinal fundoscopy where images are found to have a high level of “intra-class” similarity of shapes or patterns. In particular, the heatmap examples for the retinal fundus images show that the optic disc area was successfully identified as the most important ROI (red-colored) by the proposed CRM even though its location was very different (opposite side) among images. Accordingly, our “multi-modality” CNN model was analyzed through the CRM process such that it mainly focused on a region of an input image, which was also common in shape or pattern among other images within the same class but distinct or different from those in other classes, as important and discriminative for its correct modality classification.
On the other hand, we can also observe from the heatmap examples shown in
Figure 4e that our CNN model behaved as an object detector for images from the class of fluorescence microscopy. Here, all images were shown to have a very low level of “intra-class” similarity (randomly shaped) of objects. The majority of images in the statistical graph class were found to have
X and
Y axes (except in Venn diagram images) with a certain and consistent statistical data graph pattern (though it looked quite different among images). In such case, our CRM highlighted both
X and
Y axes and the statistical data graph, as shown in
Figure 4f.
Next, we evaluated the localization performance of the proposed CRM by comparing it with the existing localization methods: CAM and Grad-CAM. As mentioned in the previous section, these two methods were both based on a weighted sum of feature maps from the last convolution layer, and the Grad-CAM employed the ReLU function to only consider features having a positive weight value.
Figure 5 shows several examples of the heatmaps generated using each localization method. It can be first seen that the heatmaps resulting from the CAM and Grad-CAM processes were virtually the same.
Table 3 also quantitatively supports this observation by showing that the average number and ratio of pixels in images from each modality class highlighted by CAM or Grad-CAM were almost the same. Our analysis revealed that the effect of ReLU function in Grad-CAM was almost completely offset by only using the mapping score value above 20% of the maximum score of the CAM and CRM to generate their heatmaps.
Figure 5 also demonstrates that our CRM had a significant noise removal effect on heatmaps, as it consistently highlighted a much smaller size of ROI than CAM or Grad-CAM. In contrast, both CAM and Grad-CAM highlighted a much bigger region as class activation ROIs, and such ROIs included some part of the background, which had virtually no useful information.
Table 3 also showed over 30% reduction of the ROI size, on average, by the proposed CRM. Note that a higher temperature (red and yellow-colored) area on a heatmap represented the most discriminative and important ROI in the corresponding image for modality classification. Examples in
Figure 5e,f show that the CRM successfully localized and highlighted an object surrounded by a white dotted circle and the optic disc area in fundus as the most discriminative ROI, respectively, while CAM and Grad-CAM failed to do so.
We verified the consistency and localization performance of our method, shown in
Figure 6, by comparing the normal distribution curves derived from three mapping scores (normalized to the range [0,1]) for all images in each modality class. As expected, CAM and Grad-CAM generated very similar distributions of the mapping scores for all modality classes. A distribution of the CRM score was found to have substantially smaller mean and standard deviation (STD) values than those from CAM and Grad-CAM scores for the classes of (a) abdomen CT, (b) brain MRI, (c) cardiac abdomen ultrasound, (d) chest X-ray, and (f) retinal fundoscopy, implying that the ROI within an image defined by our CRM was consistently smaller and more relevant than other methods. The CRM distributions for the fluorescence microscopy and statistical graph classes, shown in
Figure 6e,g, where most images contained random sized and shaped objects or patterns, showed a mean value smaller than and a STD similar to those in CAM and Grad-CAM distributions. Such distribution results underscored all visual and quantitative findings from
Figure 4 and
Figure 5 and
Table 3.
For further verification, we applied CAM and our CRM to two other types of pre-trained DL models: Inception-v3 [
31] and Xception [
32], and compared their localization performances. Both DL models had a more complicated structure and deeper convolutional layers than VGG16, and they also showed a high performance in many transfer learning-based image classification tasks. We first removed several convolutional layers from each of these DL models to generate a higher spatial resolution of feature maps at the last convolutional layer before GAP layer to improve their localization ability, as mentioned in [
19]. Specifically, we removed the layers after “mixed6” and “add_11” from Inception-v3 and Xception, respectively, resulting in 17 × 17 of feature maps for each DL model. We then fine-tuned these DL models using our training set after adding a convolutional layer, followed by a GAP and a dense layer.
Figure 7 and
Figure 8 show some examples of CAM and CRM heatmaps generated from Inception-V3 and Xception-based DL models. We observed that the proposed CRM consistently performed better than CAM in noise removal—especially in reduction of the highlighted background—and discriminative object detection and localization, similar to our findings in
Figure 5. Thus, this further validated our method of generating a visual illustration where the DL model focused in a given image for the correct classification of its modality.
Lastly, we also generated a “class-level” ROI using the proposed “average_CRM”. It represented the image region of greatest attention by the CNN for the correct prediction of images belonging to a particular class.
Figure 9 presents the corresponding heatmaps of the “average_CRM” for each modality class. Here, “class-level” ROI was defined as the area having the “average_CRM” score above 70% of the maximum score and was visualized by enclosing it with a bounding box. It can be seen clearly that the heatmap of each modality class had a different size, shape, and location of “class-level” ROI. Such differences in the “class-level” ROI between modality classes would serve as a visual explanation for our “multi-modality” CNN model’s high performance (over 98%) in classifying image modalities. In the case of the fluorescence microscopy class, its corresponding “class-level” ROI was found to be a big rectangle shape covering almost the entire heatmap image, as shown in
Figure 9e. This was very likely due to a very low degree of “intra-class” similarity between images.
In future research, we plan to employ and evaluate other state of the art DL architectures toward self-discovering hierarchical feature representations from the medical imaging modalities. The optimal DL model would be used to more reliably and effectively localize and visualize the discriminative ROIs in the medical modality images. We will also develop a generalized version of our CRM to visualize the internal representations of not only the last convolutional layer but also any other intermediate convolutional layers for a further and deeper understanding and interpretation of a DL model.