1. Introduction
Because of the growth of image analysis-based applications, researchers have adopted deep learning to develop intelligent systems that provide learning tasks in computer vision, image processing, text recognition, and other signal processing problems. Deep learning architectures are generally a good solution for learning on large-scale data, surpassing classic models that were once the state of the art in multimedia problems [
1].
Unlike classic approaches to pattern recognition tasks, convolutional neural networks (CNNs), a type of deep learning, can perform the process of extracting features and, at the same time, recognize these features. CNNs can process data that are stored as multi-dimensional arrays (1D, 2D, and so on). They extract meaningful abstract representations from raw data [
1], such as images, audio, text, video, and so on. CNNs have also received attention in the last decade due to their success in fields such as image classification [
2], object detection [
3], semantic segmentation [
4], and medical applications that support a diagnosis by signals or images [
5].
Despite their benefits, CNNs also suffer from some challenges: they incur a high computational cost, which has a direct impact on training and inference times. Classification time is an issue for real-time applications that tolerate a minimal loss of information. Another challenge is the long training and testing times if we consider a computer with limited hardware resources. Local minima, intensive human intervention, and vanishing gradients are other problems [
6]. Therefore, it is necessary to investigate alternative approaches that may extract deep feature representation and, at the same time, reduce the computational cost.
The extreme learning machine (ELM) is a type of single-layer feed-forward neural network (SLFN) [
7] that provides a faster convergence training process and does not require a series of iterations to adjust the weights of the hidden layers. According to [
8], it “
seems that ELM performs better than other conventional learning algorithms in applications with higher noise”, presenting similar or better generalizations in regression and classification tasks. Unlike others, an ELM model executes a single hidden layer of neurons with random feature mapping, providing a faster learning execution. The low computational complexity has attracted a great deal of attention from the research community, especially for high-dimensional and large data applications [
9].
A new neural network paradigm was proposed based on the strengths of CNNs and ELMs: the convolutional extreme learning machine (CELM) [
10]. CELMs are quick-training CNNs that avoid gradient calculations to update the network weights. Filters are efficiently defined for the feature extraction step, and least-squares are used to obtain weights in the classification stage’s output layer through an ELM network architecture. In most cases, the accuracy that is achieved by CELMs is not the best of all approaches [
10]; however, the results are very competitive when compared to those that are obtained by convolutional networks, in terms of not only accuracy, but also training and inference time.
Some work in the literature has presented a survey of the ELM from different perspectives. Huang et al. [
11] presented a survey of the ELM and its variants. They focused on describing the fundamental design principles and learning theories. The main ELM variants that are presented by the authors are: (i) the batch learning mode of the ELM, (ii) fully complex ELM, (iii) online sequential ELM, (iv) incremental ELM, and (v) ensemble of ELM. Cao et al. [
8] presented a survey on the ELM while mainly considering high-dimensional and large-data applications. The work in the literature can be classified into image processing, video processing, and medical signal processing. Huang et al. [
12] presented trends in the ELM, including ensembles, semi-supervised learning, imbalanced data, and applications, such as computer vision and image processing. Salaken et al. [
13] explored the ELM in conjunction with transfer learning algorithms. Zhang et al. [
14] presented current approaches that are based on the multilayer ELM (ML-ELM) and its variants as compared to classical deep learning.
However, despite the existence of some ELM surveys, none of them have specifically focused on the CELM. Therefore, in contrast to the existing literature, we present a systematic review that concentrates on the CELM applied in the context of (i) the usage of deep feature representation through convolution operations and (ii) the usage of ELM with the aim of achieving fast feature learning in/after the convolution stage. We discuss the proposed architectures, the application scenarios, the benchmark datasets, the principal results and advantages, and the open challenges in the CELM field.
The rest of this work is organized, as follows:
Section 2 presents the methodology that was adopted to conduct this systematic review. The overview of the primary studies of this systematic review is presented in
Section 3.
Section 4,
Section 5,
Section 6 and
Section 7 present the answers for each research question defined in the systematic review protocol. Finally, we conclude this work in
Section 8.
2. Methodology
We adopted the methodology previously used by Endo et al. [
15] and Coutinho et al. [
16] to perform the systematic review. The mentioned systematic review protocol was originally inspired by the classic protocol that was proposed by Kitchenham [
17].
Figure 1 illustrates the methodology that was adopted in this work. Next, we explain each of these steps.
Identify need of review: because of the growth of image and big data applications, both academia and industry use deep learning to analyze data and extract relevant information. For large networks, deep learning architectures suffer drawbacks, such as high computational cost, slow convergence, vanishing gradients, and hardware limitations for training.
In this systematic review, we mainly investigate the use of the CELM as a viable alternative for deep learning architectures while guaranteeing quick-training and avoiding the necessity of gradient calculations to update the network’s weights. In recent years, CELMs have solved some of the leading deep learning issues while maintaining a reasonable quality of solutions in many applications.
Define research questions: we begin our work with the definition of four research questions (RQ) that are related to our topic of study. Our objective is to answer these research questions to raise a discussion of the current state of the art in the usage of the CELM in the specific domain of image analysis. The research questions are as follows:
RQ 1: What are the most common problems based on image analysis and datasets analyzed in the context of the CELM?
RQ 2: How are the CELM architectures defined in the analyzed work?
RQ 3: Which are the main findings when applying the CELM to problems based on image analysis?
RQ 4: What are the main open challenges in applying the CELM to problems based on image analysis?
Define search string: to find articles related to our RQs, it was necessary to define a suitable search string to be used in the adopted search sources. To create such a search string, we defined terms and synonyms related to the scope of this research. The search string that was defined was "(("ELM" OR "extreme learning machine" OR "extreme learning machines") AND ("image recognition" OR "image classification" OR "object recognition" OR "object classification" OR "image segmentation"))".
Because we considered the four primary databases (IEEE Xplore, Springer Link, ACM DL, and Science Direct) and two meta-databases (SCOPUS and Web of Science), we first selected the articles from the primary databases because the meta-databases provided some duplicate results.
Define criteria for inclusion and exclusion: we defined criteria for the inclusion and exclusion of articles in this systematic review with the aim of obtaining only articles within the scope of this research. The criteria were, as follows:
primary studies published in peer-reviewed journals or conferences (congress, symposium, workshop, etc.);
work that answers one or more of the RQs defined in this systematic review;
work published from 2010 to 2020;
work published in English; and,
work accessible or freely available (using a university proxy) from the search sources used in this project.
Identify primary studies: we identified the primary studies according to the inclusion and exclusion criteria.
Extract relevant information: we extracted relevant information from the primary studies by reading the entire paper and answering the RQs.
Present an overview of the studies: in this step, we present a general summary of the primary studies that were selected in the systematic review. The overview information includes the percentage of the year of publication of the articles and the database from which they were obtained.
Section 3 presents the overview of the studies.
Present the results of the research questions: considering the research questions, we present the answers found from the analysis of the selected articles. The answers to the defined research questions are the main contribution of this systematic review.
Section 4,
Section 5,
Section 6 and
Section 7 present the results of this step.
4. Common Problems and Datasets
From the primary studies, the main machine learning problems for multimedia analysis can be divided into two main groups: image classification and semantic segmentation.
Eighty studies were found to be related to image classification. Image classification is the process of labeling images according to the information present in these images [
2], and it is performed by recognizing patterns. The classification process usually analyzes an image and associates it with a label describing an object. Image classification may be performed through manual feature extraction and classical machine learning algorithms or deep learning architectures, which learn patterns in the feature extraction process.
Only one study [
23] covered semantic segmentation. Semantic segmentation in images consists of categorizing each pixel present in the image [
4]. The learning models are trained from ground truth information, which are annotations equivalent to each pixel’s category pertinence of the input image. This model type’s output is the segmented image, with each pixel adequately assigned to an object.
The triviality of implementing the CELM models for the first purpose is the factor that may explain the high difference in the number of studies for image classification instead of semantic segmentation. For the image classification task, the architectures are stacked with convolutional layers, pooling, and ELM concepts placed sequentially (see more details in RQ 2). This fact facilitates the implementation of the CELM models.
Models for semantic segmentation need other concepts to be effective. In semantic segmentation, it is necessary to make predictions at the pixel level, which requires the convolution and deconvolution steps to reconstruct the output images. These concepts may be targeted by researchers in the future.
Note that object detection is also a common problem in the computer vision field, but we did not find studies solving object detection using CELM concepts in this systematic review.
We found 19 different scenarios from the primary studies, but most of them contained four or fewer related studies. Thus, we highlight the six main application scenarios, and the others are demarcated in a single group (Others), as shown in
Figure 3. The six main CELM application scenarios that were found among the primary studies were object recognition, remote sensing, medicine, handwritten digit recognition, RGB-D image recognition, and face recognition, totaling 69 articles, or about 83% of the total primary studies.
4.1. Object Recognition
Object recognition is one of the most common problems addressed in the primary studies found in this systematic review. Object recognition consists of classifying objects in scenes and it is not a trivial task. Generally, the dataset that makes up an object recognition problem comprises several elements divided by classes. The variation is realized in the object positions, lighting conditions, and so on. We found 18 studies dealing with object recognition—approximately 21% of the primary studies.
Among the primary studies, we found nine different object recognition datasets: NORB, CIFAR-10, CIFAR-100, Sun-397, COIL, ETH-80, Caltech, GERMS, and DR. These datasets, in general, have a wide range of classes, hampering the ability to generalize machine learning models. Therefore, if a proposed model obtains expressive results using large datasets for object recognition, there is strong evidence that this model presents a good generalization capacity.
Table 2 shows the reported datasets for object recognition and their respective references. Most of the studies use the NORB and CIFAR-10 datasets (representing more than 50% of usage). Note that some studies use more than one dataset for training and testing their models. Next, we present a brief description of the main datasets that are found for object recognition.
The NYU Object Recognition Benchmark (NORB) dataset [
40] is composed of 194,400 images that are pairs of stereo images from five generic categories under different angles, light setups, and poses. The variations consist of 36 azimuths, nine elevations, and six light setups.
The Canadian Institute For Advanced Research (CIFAR-10) [
41] is a dataset that contains 60,000 tiny images of
in size divided into 10 classes of objects, with 6000 images per class. Additionally, CIFAR-100 contains 100 classes of objects, with 600 images per class. The default split protocol is 50,000 images for the train and 10,000 for the test set.
The Columbia University Image Library (COIL) [
42] is an object image dataset. There are two main variations: COIL-20, a dataset that contains 20 different classes of grayscale images; and, COIL-100, which contains 100 classes of colored images. A total of 7200 images compose the COIL-100 dataset.
Caltech-101 [
43] is a dataset that contains 101 categories. There are 40 to 800 images per class, and most of the categories contain about 50 images with a size of
. The last version of the dataset is Caltech-256, which contains 256 categories and 30,607 images.
ETH-80 [
44] is a dataset that is composed of eight different classes. Each class contains 10 object instances, and 41 images comprise each instance. There is a total of 3280 images in the dataset.
6. Main Findings in CELM
Based on the scenarios and the most common datasets used in the primary studies, in this subsection we describe the main findings when applying the CELM to image analysis.
Next, we present the accuracy results of the CELM models using the primary databases that are presented in
Section 4. We also present the time that is required for training and testing the CELM models. It is worth mentioning that the presentation of these time results shows that CELM models are trained and tested in less time than classic machine learning models, and it is important not to compare them against each other, as each model was trained and tested in different machines with different setups.
The authors in [
6,
24,
25,
26,
27,
28,
29,
30] applied CELM to solve the object recognition problem using the NORB dataset. In general, all studies presented a good accuracy, with all of them achieving over 94%, as shown in
Table 13. All of these studies performed comparisons against algorithms, such as the classic CNN, MLP, and SVM, and the CELM models outperformed all of the studied approaches.
The best accuracy results were 98.53%, and 98.28%, which were achieved by [
28] using an ELM-LRF with autoencoding receptive fields (ELM-ARF) and [
30] using the ELMAENet, respectively. This demonstrates the excellent representativeness of the extracted features and generalization capability of ELM models.
In general, we noted that some of the studies only presented the training time in the papers for the NORB dataset in their experiments. For this reason, in
Table 13, we do not consider testing time in the discussion. The best training time was achieved by [
27] (216 s), and this result was probably due to the compact autoencoding features by ELM-AE. The worst result was achieved by [
29] (4401.07 s). The difference may probably be due to the different machine and scenario setup, as previously discussed. In general, ELM-LRF-based architectures provide a low training time due to the simplicity of the architectures. All of these architectures presented better training results than classic machine learning models.
The authors in [
46,
47,
48,
49,
50,
51,
54] used the Pavia dataset for remote sensing classification. Note that remote sensing approaches use other evaluation metrics, such as average accuracy (AA), overall accuracy (OA), and Kappa, as shown in
Table 14.
The most common approach used for this purpose is a CNN that was previously pre-trained in the Pavia dataset used for feature extraction and an ELM for the classification task [
46,
49,
50]. However, the ELM-HLRF thatwas proposed in [
51] achieved the best AA and OA results, at 98.25% and 98.36%, respectively.
Most of the studies did not report any results regarding the training or testing time, but we show the effectiveness in these metrics for remote sensing classification. The work in [
46] reported a low training time of 14.10 s, and the work in [
47] achieved 0.79 s of testing time.
Table 15 presents the accuracy results using the MNIST dataset for handwritten digit recognition being performed by [
24,
27,
28,
29,
30,
34,
74,
75,
76,
77]. All of the studies presented a high accuracy of over 96%. The training time varied considerably, ranging from 8.22 s [
76] to 2658.36 s [
29]. Regarding the testing time, the work that was performed in [
76] also presented the best performance (0.89 s).
Different neural network implementations can make a difference in processing time, which can explain the difference in the work that was performed in [
76] to others. Besides having the best training and testing time, the work presented in [
76] achieved the worst accuracy for the handwritten digit classification task (96.80%).
We highlight the work presented in [
30], which outperformed other accuracy metric models (99.46%) using the ELMAENet. The results showed that feature representation in ELM-LRF and the CNN with ELM-AE was sufficient for reaching a good accuracy result. In the learning task, the accuracy was superior to 99% in both cases. The results obtained by the studies demonstrate that the CELM approaches have good generalization performance in this benchmark dataset.
Table 16 shows the results that were related to the YALE dataset for face recognition obtained by the studies [
26,
28,
81]. All of the studies reported an accuracy that was superior to 95%. The best accuracy result was found by [
81] (98.67%), and the worst was achieved by [
26] (95.56%).
The accuracy result that was obtained by [
81] (PCA convolution filters) and [
28] (multiple autoencoding ELM-LRF) demonstrate that the use of multiple random or Gabor filters was not sufficient for providing good representativeness of the data for training in an ELM using the YALE dataset. The studies [
28,
81] have more robust architectures, which can explain the better accuracy result.
Only the work [
28] presented training and testing times, at 16 and 0.38 s, respectively. The literature suggests that CELM approaches can also reach good accuracy results in the face recognition problem. On the other hand, the training and testing time was not clear due to the missing reported results.
Table 17 shows the results when solving RGB-D image recognition using the Washington RGB-D Object dataset in the studies [
90,
91,
92,
93,
94]. RGB-D image recognition is a task that considers two types of data, such as the RGB color channel and the depth, which makes the classification task more difficult. The accuracy results varied from 70.08% (single ELM-LRF) to 91.10% (VGGNet-ELM). One can note improvements when the RGB and D channels are separately processed in random filter representations [
91,
92,
93]. There is no significant difference in the results that were reached in feature extraction by random convolutional architectures (up to 90.80% [
92]) and pre-trained architectures (91.10% [
94]). Besides the high complexity for RGB-D classification, the CELM architectures reached good accuracy. Besides providing a low accuracy, the work presented in [
90] achieved the best training time due to its network complexity (192.51 s).
In general, one can note that the CELM models provide satisfactory results in terms of accuracy and computational performance (training time and testing).
CNN-based approaches with predefined kernels for feature extraction provide good results in terms of accuracy and training time. In two scenarios (object and face recognition), the architectures of this type presented better accuracy [
27,
81] than other approaches, such as the deep belief network and stacked autoencoders. The excellent performance of this approach in the computational aspect is due to its one-way training style. The feature extraction is the most costly stage due to the high number of matrix operations in the CNN. However, when it comes to the training stage using ELM, the processing time is not an aggravating factor, except when the architectures’ complexity is increased.
Regarding the approaches that use pre-trained CNN architectures (in the same or other domain) to extract characteristics and perform later fine-tuning with am ELM, it is also observed that the results are satisfactory. This approach outperforms others in the remote sensing and RGB-D image recognition scenarios [
49,
94] when considering the accuracy metric. Classic CNNs and support vector machines are examples of outperformed approaches. This approach’s training method is also a one-way training style, which explains the excellent training time involved in the learning process.
The fast training approaches for CNN models using ELM concepts could not be further analyzed, because only a few studies were found in the literature. However, one can note that this approach outperforms other CELM models, such as ELM-LRF and PCANet-ELM, in terms of accuracy when considering the handwritten digit recognition problem [
30]. Instead of using the backpropagation algorithm for feature training, the authors used the ELM-AE network, obtaining a more compact representation of data and better training time.
In general, CELM presented interesting results regarding the accuracy as compared to several proposals that were found in the literature. Despite not having the same power as the conventional CNNs (with fully connected layers and backpropagation) to extract features, the CELM’s accuracy proved competitive in the analyzed scenarios and benchmark datasets. The competitiveness of the results is clear when, in many cases, CELM was superior to several traditional models, such as MLP (as in [
50,
69,
102]) and SVM (as in [
46,
90,
113]). Observing these results, we reported a good generalization and good representativeness for the CELM [
27,
49,
55,
57,
68,
97,
104].
From the primary studies, we also notice that CELM architectures have good convergence and provide better accuracy. Changing the fully connected layers to an ELM network consequently increases the training speed and avoids fine adjustments [
30,
115,
122,
131]. Convergence is achieved without iterations and intensive updating of the network parameters. In the case of a CNN for feature extraction with an ELM, the training is done with the ELM after the extraction of CNN features. Rapid training reflects directly on computational performance. With the adoption of the CELM, it is possible to decrease the processing time that is required for the learning process. This feature makes the CELM able to solve problems on a large scale, such as real-time or big data applications [
29,
94,
111].
7. Open Challenges
Despite the many advantages of the CELM architectures, such as suitable training time, test time, and accuracy, some open challenges can serve as inspiration for future research contributing to the advancement in the field of research into the CELM.
It is known that the number of layers can be an important factor in the ability to generalize a neural network. Classic studies of deep learning proposed architectures with multiple convolution layers [
18,
19,
129]. However, when the number of layers is increased, problems with increasing training time and a loss of generalization capacity emerge [
77], which can cause overfitting issues. These two reasons may explain the reason that many CELM architectures with predefined kernels do not use very complex architectures to extract features.
Despite the good performance of GPUs, sometimes it is not possible to use them in a real environment. When this happens, all of the data are stored sequentially in the RAM and processed by the CPU, increasing the training time, especially when handling data with high dimensionality. One possible way to overcome this issue is using approaches that aim at high-performance computing using parallel computing. Furthermore, the usage of strategies for batching the features can replace the number of samples
N in the memory requirements [
9]. There is an approach in the literature that aims to use an ELM for large-scale data problems, known as the high-performance extreme learning machine [
9], which could be adequately analyzed in the context of the CELM.
Regarding the problem of the number of convolutional layers, the gradual increase in the complexity of the network can cause problems in the model generalization. This can decrease the accuracy and cause overfitting. Some studies in the literature have proposed using new structures that increase the number of layers without a loss in the generalization of the network and improve the accuracy results, such as residual blocks [
19] and dense blocks [
129]. This is another research challenge that can be considered in CELM architectures, increasing the number of layers to increase the accuracy without losing the network’s generalization capacity. These deep convolutional approaches should inspire CNN architectures for the CELM.
There is also a research field that aims to make deep learning models more compact, which would accelerate the learning process. Traditional CNN models generally demand high computational cost and, by compressing these models, it is possible to make them lighter in terms of their computational cost. Two well-known techniques used for CNN compression are pruning and weight quantization. The pruning process handles the removal of a subset of parameters (filters, layers, or weights) evaluated as less critical for the task. None of the studies reported in this systematic review reported the use of pruning or weight quantization. Approaches for pruning or weight quantization (or a combination of both) could improve the learning process of CELMs, removing irrelevant information in the neural network and optimizing the support for real-time applications.
In this systematic review, we did not report any work on object detection problems. Deep learning research field architectures for object detection, such as R-CNN, Mask R-CNN, and YOLO, could inspire new CELM studies. Such architectures have high computational costs. When the object detection deep learning models are processed into the CPU, there is a loss in computational performance. Developing new architectures for object detection using ELM concepts could help such applications where computational resources are limited.
Another common computer vision problem that recurs in the literature and it is little addressed in this systematic review is semantic segmentation. The difficulty may be linked to image reconstruction and decoding operations through deconvolutions usually done through the backpropagation algorithm. This is another open challenge in the CELM, where ELM networks could replace the backpropagation in the calculation to update the weights of both convolutional and deconvolutional layers for the reconstruction of the segmented images.
Despite presenting promising and interesting results in RGB-D classification and remote sensing tasks, there is a lack of CELM networks in this area. There are no studies to date that prove the strength of the CELM in very large datasets for even more complex tasks. Therefore, there is a need for a performance evaluation (accuracy and computation) of CELM models on the large current state of the art, such as ImageNet, the COCO dataset, and Pascal-VOC. These last three cited databases are current references in deep learning for image classification, object detection, and semantic segmentation, in addition to other problems, such as the detection of human poses and panoptic segmentation, and so on. The performing of new experiments on the state of the art datasets in deep learning can strengthen all aspects of the CELM’s advantages that are covered in this systematic review.
8. Conclusions
CELMs are quick-training CNNs that avoid the use of backpropagation calculations for updating the network weights. Filters are efficiently defined for the feature extraction step, and least-squares obtain weights in the classification stage’s output layer through an ELM network.
We presented a systematic review on the CELM in image analysis while considering the literature published over the last 10 years. Initially, we collected 2220 articles, and after removing duplicate and applying inclusion criteria, we analyzed 81 studies on the CELM. We reported 19 different scenarios, and object recognition was the most common application where the CELM was used. Additionally, we have found and classified the studies into four different types of CELM architectures: (i) a CNN with predefined kernels for feature extraction and an ELM for fast training; (ii) a pre-trained CNN in other application domains for feature extraction and an ELM for fast training; (iii) a pre-trained CNN in the same application domain for feature extraction and an ELM for fast training; and, (iv) the fast training of CNNs using ELM concepts. The CNN with predefined kernels was the most common architecture that was proposed in the literature, followed by the pre-trained CNN in same application domain.
Analyzing the primary studies, we can state that CELM models provide good accuracy and good computational performance. We highlight the excellent feature representation that is achieved by the CELM, which can explain its good accuracy results. In general, the CELM architectures present fast convergence by changing the conventional fully connected layers to the ELM network. This change avoids fine adjustments by the backpropagation algorithm’s iterations. Finally, there is a decrease in the total processing time that is required for the learning process when using CELM architectures, making it suitable to solve image analysis problems in real-time applications.
As limitations of this work, we could initially cite the range of the systematic review; it was focused on the last 10 years of the literature, but, as presented, the first papers regarding the CELM were published in 2015, and most of them after 2018. Moreover, as time passes and the CELM becomes a more active research area, other new studies are constantly being published, and keeping the systematic review up-to-date is a difficult task.
In terms of future research that is based on the findings and open challenges of this systematic review, we highlight the following directions: (i) strategies to increase the number of convolutional layers without negative impacts on overfitting, training time, and generalization issues; (ii) optimized implementations of the ELM for high-performance computing in the context of the CELM can address the training time issue; (iii) pruning, compaction, and/or quantization techniques for convolution layers may accelerate the learning process; (iv) the implementation of new CELM architectures inspired by traditional deep learning architectures—e.g., R-CNN, Mask R-CNN, and YOLO—for object detection and image segmentation; and, (v) regarding the reconstruction of segmented images, it would be possible to investigate the use of ELM networks to replace backpropagation in the calculation of the updates of the weights of the deconvolutional/upsampling layers.