3.1. Image Modality Classification
Regarding the classification of dermatological images’ modalities,
Table 2 shows the test results with respect to the models, which, as previously mentioned (
Section 2.3), were only trained with images from Task A.
As it is possible to infer from
Table 2, independently of the models used, and analyzing the F1-score, the modality that most negatively influenced the results was full-body. This may be due to the smaller variability of images belonging to this category since the full-body modality was the less representative class of the original dataset where only 180 examples were available for training (
Table 1). Although an oversampling has been made to balance the classes during the training phase, only a small feature diversity was introduced with this technique.
Comparing the results achieved by the two models, the VGG-16 model surpassed the performance of the MobileNetV2 in almost all metrics and classes, as highlighted in
Table 2. These outcomes may result from the model’s higher complexity, being able to better identify features intrinsic to each modality.
To further understand these results, the confusion matrices corresponding to the two models were plotted and can be found in
Figure 3.
Looking at these matrices, it may be verified that in general, both the VGG-16 and the MobileNetV2 were able to correctly predict the modalities of the different dermatological images, which is represented by the darker shades on the matrices’ diagonal. Nevertheless, some anatomic, full-body, and macroscopic images were confused by the two models. It is worth noting that in some cases, the images belonging to these classes are very similar, and it was difficult to effectively differentiate them. Hence, as the labeling of the images was manually made by various people, it is possible that different labels have been assigned to identical images, which may have influenced the results.
To allow a later comparison with the results obtained after being incrementally trained with Task B images, these models were also evaluated on the test images belonging to the incremental task (Task B) and on the global test set containing the test images concerning both tasks (Tasks A and B). The corresponding accuracy results may be found in
Table 3. Regarding the adopted terminology, as proposed in [
35],
corresponds to the test accuracy of Task B after the models have been trained with images from Task A, for instance.
3.2. Incremental Learning of Image Modalities
To continue training models with images from Task B, different settings were employed in the implementation of the considered strategies, namely the
value, and memory sizes, and various number of epochs were used, as detailed in
Section 2.4.1 and
Section 2.4.2, respectively. Regarding the number of epochs, with respect to the VGG-16 model, in the case of the naive, the EWC, and the A-GEM strategies considering a memory size of 100 and 150, it was verified that when a higher number of epochs was considered, the global performance of the model decreased and catastrophic forgetting increased. Therefore, it is preferable to consider a lower number of epochs when implementing these strategies, such as 10 epochs. In the case of the A-GEM strategy with a memory buffer of 50 examples and of the experience replay strategy considering 250 and 500 images from the first task, it was advantageous to use an intermediate number of epochs, since when the model was trained for 20 epochs, it was possible to further reduce the forgetting, compared with the results achieved for 10 and for 30 epochs. Finally, only the experience replay strategy with a memory size of 100 demonstrated an improvement on the global performance when the model was trained for a larger number of epochs (30 epochs). Concerning the MobileNetV2 model, it was verified that all strategies benefited from being trained for a lower number of epochs, i.e., for 10 epochs. Therefore, in
Table 4 it is possible to find the results achieved by each strategy with respect to the number of epochs that provided the best outcomes and concerning both models. The presented results refer to the global test accuracy (considering the test images belonging to both Task A and B together, represented by
) after the two tasks have been trained sequentially, and the catastrophic forgetting (assessed by the BWT metric).
The analysis of this table demonstrates that in both models it was not possible to completely avoid the catastrophic forgetting (i.e., to preserve all the knowledge acquired in Task A after the models have continued to be trained with images from Task B), as the BWT values remain negative. Nevertheless, the explored incremental learning strategies allowed a reduction of it, which may be verified by the increase in the BWT values when compared to the ones obtained with the naive strategy which works as a baseline strategy. Moreover, when the incremental strategies were employed, the global test accuracy also improved when compared to when the models were simply fine-tuned (naive strategy), which results from the fact that they allowed the retention of more information concerning the first task.
Bearing these results in mind, in
Figure 4, a comparison of the forgetting (assessed through the BWT metric), achieved by the two models and using various incremental learning strategies, is depicted.
It is possible to verify that the MobileNetV2 model surpassed the VGG-16 model in what concerns forgetting, being able to better preserve the knowledge acquired on the previous task (Task A). This is demonstrated by the higher BWT values obtained for all the implemented incremental learning strategies when the MobileNetV2 model was employed, which means that catastrophic forgetting verified with this model was lower.
Furthermore, other conclusions may be taken from this plot and from
Table 4, namely concerning the comparison of the different incremental strategies. It is possible to infer that for both models, the rehearsal strategies (A-GEM and experience replay) demonstrated outperformance of the employed regularization strategy (EWC), for almost all the considered
values and memory sizes. Moreover, in the case of the EWC regularization strategy, the
value that provided the best results in terms of global accuracy and forgetting corresponded to 50. Concerning the rehearsal strategies, it is verified that as a higher number of examples from the first task was considered (i.e., as the memory size increased), the performance of the models improved. Therefore, for each incremental learning strategy, a more detailed analysis was addressed taking into account the
value (in the case of the EWC) and the memory sizes (in the case of the rehearsal strategies) that led to the best outcomes. In
Table 5, the accuracy results corresponding to the implementation of the incremental strategies using these parameters among the different tasks may be found.
These results are in line with what was previously mentioned: on the one hand, for all strategies, the performance of both models on the first task decreased after they had been incrementally trained with the images corresponding to Task B (), which results from catastrophic forgetting; on the other hand, it was possible to improve the results of Task B (), when compared with the ones obtained right after the training of Task A (), since the incremental training allowed the models to learn features of the incremental images.
As previously mentioned, the FWT metric was also computed. In the case of the VGG-16 model, an FWT value of 0.7303 was achieved, while in the case of the MobileNetV2 this value corresponded to 0.6910, which means that after being trained with images from Task A only, the VGG-16 model could better perform on Task B.
Although accuracy is a standard metric used to evaluate incremental learning approaches, to assess the performance of the models with respect to the classes predicted after they have been trained incrementally, other metrics were also computed. In
Table 6, it is possible to find the F1-score results achieved with the test images from Task A after the models have been trained on Task B. These results are averaged over the ten iterations that were made. Besides that, in
Figure 5 and
Figure 6 the confusion matrices of Task A test images concerning a randomly selected iteration may be observed, allowing an understanding of which modalities were most affected by the incremental training. Despite the presented matrices being related to only one iteration, they were plotted for all iterations, to avoid biased conclusions.
Confronting these results with the ones achieved after the first training (
Table 2), the anatomic modality was the one that underwent the most changes after the models were incrementally trained with images from Task B. This may be verified by a steeper decrease in the F1-score metric in the case of this modality. Moreover, observing the confusion matrices presented in
Figure 5 and
Figure 6, we can see that these images were essentially misclassified as full-body or macroscopic images.
Some examples of images from Task A that were correctly classified after the first training but misclassified after the incremental one can be found in
Figure 7. By looking at these images, it is then possible to confirm that they present patterns in common with the images that were considered in the incremental task (
Figure 2), despite belonging to different modalities. This may explain the alteration verified in their classification, as for instance, the anatomic modality of Task B comprised images of hand and feet, whereas in the first task these images were assigned to the macroscopic class, as is the case of the rightmost images in
Figure 7. As mentioned above, this may result from a labeling issue, as some similar images were assigned to distinct classes, and, when dividing the dataset, these may have been allocated to different tasks.
Besides the naive strategy, a cumulative strategy was also explored as a baseline strategy. As previously introduced, this strategy consists of a model retraining considering all examples from previous and new tasks. Hence, in the context of this problem, the VGG-16 and the MobileNetV2 models were trained considering images belonging to Task A and B together. The models were also tested on the images concerning Task A (
), Task B (
), and on the global test set (Task A + B). The accuracy results obtained through this approach are presented in
Table 7.
By comparing the results achieved with the cumulative strategy with the ones corresponding to the incremental learning strategies (
Table 5 and
Table 7, respectively), it may be verified that the performance of Task A improved when the models had access to all images (A + B) at the same time, instead of being trained incrementally. This is demonstrated by the higher
value reached with the cumulative strategy, compared to the
values obtained with the incremental learning approaches. A possible explanation for this relies on the difference observed in the number of images belonging to the two tasks, since as the first task contains more images, when the models were trained with all images together, they were not as affected by the Task B images as when the incremental training was done. Regarding Task B and the VGG-16 model, it was verified that when all images were available (cumulative strategy,
), the performance of this task decreased in relation to an incremental training (
), except for the experience replay with a memory buffer of 500 examples from the first task that achieved lower results on the incremental scenario. This exception may result from the fact that when more images from Task A were considered in the incremental training, due to their higher representativeness, the incremental model could not fit so well to Task B. Thus, the same aforementioned reason may be responsible for the lower performance on the cumulative scenario, since due to the smaller number of images belonging to the incremental task, when the model was trained with all images at the same time, it was not able to properly adjust to this task. Nevertheless, in what concerns the MobileNetV2 model, in general, the Task B performance was better when the model was trained with all images together. This turns out to be in accordance with what was previously verified, as the forgetting of the MobileNetV2 was lower. This means that, when incrementally trained, the model did not fit so well to the images belonging to Task B, whereas when all images were trained together, it was adjusted to the global domain. Moreover, comparing the overall outcomes of the models when using the incremental learning strategies (
in
Table 4) and when trained in a cumulative scenario (
), it is possible to see that when the models were trained with images regarding the two tasks together from the beginning, their performance improved. However, this scenario implies that all images are available at the training time, which may be unfeasible in terms of the required memory to store all examples, or even due to the computational cost involved to train them. Bearing this in mind, training models incrementally may be preferable over retraining them as new images are available, presenting a better trade-off between the achieved performance and the required costs with respect to storage capacity and computation.
The efficiency assessment of the incremental learning strategies was made both in terms of the time taken by each epoch at the learning phase, and of the RAM required to train the models. Although this evaluation has been made for all strategies, only the results with respect to the parameters that led to a better performance of each strategy are presented in
Table 8. Furthermore, in
Figure 8, a visual comparison of the presented results may be found, where a representation in terms of global test accuracy, training time, and RAM is shown.
These values demonstrate that in terms of time, each epoch of the rehearsal strategies took longer to be trained. This results from the higher amount of considered examples, as some information concerning the first task is trained together with the incremental one. Therefore, since the experience replay strategy that uses 500 examples from Task A was trained for 20 epochs in the case of the VGG-16 model, among the strategies presented in the table, this was the strategy that took the longest to be trained.
Besides that, regarding the required RAM, the A-GEM strategy involved a higher computational cost when compared to all other strategies, being even unfeasible to be trained when a memory size higher than 150 was applied. Therefore, comparing the two employed rehearsal strategies, although in the case of the VGG-16 model, the experience replay has taken longer to be trained due to the higher number of epochs, in what concerns the efficiency in terms of the required RAM and taking into account the obtained accuracy and forgetting results, the experience replay strategy may be advantageous when compared to the A-GEM strategy.
Moreover, despite the performance of the rehearsal strategies being better when compared to the EWC, these strategies require that some previous images are maintained in memory to be later used in combination with the incremental set. Thus, if these images are not available or if the training time is a conditioning factor, taking into account that the regularization strategy was also able to achieve promising results, this strategy may be preferable over the A-GEM or experience replay in the context of this problem.