You are currently viewing a new version of our website. To view the old version click .
Algorithms
  • Article
  • Open Access

17 October 2018

LSTM Accelerator for Convolutional Object Identification

,
,
and
1
Computer Engineering and Informatics Department, University of Patras, Patras 26504, Greece
2
Department of Informatics, Ionian University, Corfu 49100, Greece
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Humanistic Data Mining: Tools and Applications

Abstract

Deep Learning has dramatically advanced the state of the art in vision, speech and many other areas. Recently, numerous deep learning algorithms have been proposed to solve traditional artificial intelligence problems. In this paper, in order to detect the version that can provide the best trade-off in terms of time and accuracy, convolutional networks of various depths have been implemented. Batch normalization is also considered since it acts as a regularizer and achieves the same accuracy with fewer training steps. For maximizing the yield of the complexity by diminishing, as well as minimizing the loss of accuracy, LSTM neural net layers are utilized in the process. The image sequences are proven to be classified by the LSTM in a more accelerated manner, while managing better precision. Concretely, the more complex the CNN, the higher the percentages of exactitude; in addition, but for the high-rank increase in accuracy, the time was significantly decreased, which eventually rendered the trade-off optimal. The average improvement of performance for all models regarding both datasets used amounted to 42 % .

1. Introduction

Nowadays, the quantity of available data keeps increasing on a daily basis, mainly because of the vast production that an average user can provide. This multimedia type of information has made possible new ways of processing data and understanding how to restructure them for our daily needs. However, still, we are not taking full advantage of the meta-information that the data present to us, mainly because of the lack of our ability to pinpoint their inherent visual representation. In order to fulfil this ever-growing need, the creation of systems that can explain a visual piece of information through semantic and algebraic features is deemed highly integral.
In most cases, dividing the procedure of representing and deducting the content of an image can be arranged just as in the work of [1]. In the beginning, the features that are to be produced must consist of a low-level representation, before advancing to more perplexing stages. Most common of these are color-related features, the texture of an image or even its shape. This low-level representation of characteristics can be easily approximated with a single vector representation as a flattened version of the image matrix. In order to reduce their sensitivities to potential information loss and ambiguity, it is attempted to provide them with more context-specific descriptors, since the aforementioned features are further developed. Such attempts were introduced in the works of SIFT [2], SURF [3] and HoG [4]. These newly-found descriptors appear to have a more robust behavior on homography distortions, especially when it comes to a comparison with the global features. To summarize the extracted knowledge of such procedures, the spatial features are rearranged with bag of visual words [5].
Another work heavily based on descriptor extraction was produced by [6]. The L 2 norm was utilized in this work, during the training and testing steps, mainly to create the multi-dimensional feature maps. These descriptors were easily adapted to Siamese networks with non-corresponding patches, thus enabling its utility in every algorithm pertaining to the logic of SIFT.
Studies have recently indicated that there does not exist a single procedure to characterize all datasets with the same features accordingly. However, the pure unprocessed data can often provide the most beneficial routes to extract the desired features of the available data, in contrast to the ability of a human to manually assign labels according to his/her general observations. This human-machine contradiction and comparison is also seen in the work of [7]. Despite the obvious ability of humans to correctly label faces, especially widely-known people that can rarely be misunderstood, their proposed network conglomerate of layers achieved almost identical accuracy, as well as false positive rates along with the manual human labeling. Of course, the time that the algorithm consumed to rank each face image, in comparison to the human labeling, was negligible, as the computer processing the images needed 0 . 33 seconds per image for the complete assignment of a label to a face.
Currently, the breakthrough of deep learning as a subfield of machine learning [8] has introduced a new idea that consists of translating the pixels of a matrix to an applicable form through iterative algorithmic learning. Most algorithms that pertain to the deep learning field pursue a high-level generalization of the available data at hand, through a hierarchical stack of processing layers. Each paralleled layer can produce a milestone output that has concluded a partial analysis on the starting image. Many state-of-the-art deep neural networks have peaked at substantial performances through an abundant layer depth analysis on the corresponding datasets. As the depth of the stack increases, so does the complexity. The two metrics are not always analogous, and in particular situations, a bigger neural network may be deemed detrimental in comparison to a smaller, more compact and, thus, more sufficient in terms of algorithmic time needed.
Taking a step further in the logical continuum of deep learning, the computational power of the current technology can be utilized to a great extent in order to detect the desired feature descriptors for input images. Deep learning algorithms in their nature are able to train themselves through the incoming input data in order to create the high level abstraction that describes the data. The most appropriate kind of architecture for such tasks is comprised by Convolutional Neural Networks (CNNs). The CNNs posses a layer stack that convolves the input image against a number of filters before they present the final result. These filters are an inextricable part of the CNN and contribute immensely to the simple convolution.
The primary contribution of this article consists of the following aspects. Initially, a framework capable of generating hierarchical labels, by integrating the powerful Convolutional Neural Networks (CNN), used to generate discriminative features, is introduced. Secondly, batch normalization is utilized so as to allow the use of much higher learning rates and be less careful about initialization. Specifically, it acts as a regularizer, whereas when applied to a state-of-the-art image classification model, batch normalization achieves the same accuracy with fewer training steps and beats the original model by a significant margin. In the following, the LSTM neural net layers, which frame the CNN schema, are properly introduced. Following the convolution feature extraction procedure, the long short-term memory NNs tackle a variation of a min-max problem; this neural layer attempts to maximize the yield of the complexity tendency to diminish, while simultaneously minimizing the loss of accuracy in the process. The input data feature maps, as generated by the CNNs, are considered a sequential streamline of images. Such image sequences can be classified by the LSTM in a more accelerated manner, as well as accomplish appealing amounts of approximately indistinguishable accuracy.
In initial efforts, such as [9], the authors used more modern CNNs for encoding, and thus, they replaced feed-forward networks with recurrent neural networks [10]. In particular, the LSTM cell has been utilized, as it outperforms other competing methods and also learns to solve complex, artificial tasks that no other recurrent net algorithm has solved [11]. The common theme of these works is that they represented images as the top layer of a large CNN (hence, the name “top-down”, as no individual objects are detected) and produced models that were end-to-end trainable. Furthermore, a multiplicative input gate unit is introduced in order to protect the memory contents stored in a linear unit from perturbation by irrelevant inputs. The resulting, more complex unit is called a memory cell, where its cell is built around a central linear unit with a fixed self-connection [12]. We differentiate our proposed work from the one in [12], as optimization techniques and different metrics for validating the improvement of the proposed architecture are incorporated.
The remainder of the paper is structured as follows: Section 2 overviews related work. Section 3 presents in detail the techniques, modules and sub-modules of our model that have been chosen, while in Section 4, our proposed system architecture is presented, as well as details of the implementation and the evaluation conducted in Section 4. Ultimately, Section 5 presents conclusions and draws directions for future opus that may extend the current version and performance of this work.

3. Material and Methods

In this section, a formal definition of all the layers that constitute our hybrid schema and a brief description of the semantics that lie behind each case are presented.

3.1. Convolutional Layer

The two main parts that constitute a convolutional neural net that is applied as a layer to the input data are the following. The sequence of the input-stream is distributed to a number of d filters, whose size amounts to r, F d × r :
f t = ϕ ( F [ e t - ( r / 2 ) + 1 ; ; e t ; ; e t + ( r / 2 ) ] )
The activation function of our layer is denoted as ϕ . Other options, such as tanh or a rectifier unit, exist, although a non-linear function is chosen. For every part of the sequence of inputs that the function is applied to, a part of the resulting sequence is produced that we denote as F = ( f 1 , f 2 , , f T ) .
The product of the convolutional layer F is then followed by a max-pool layer with a kernel of according to size r , while all the elements of the resulting vectors are processed through m a x , thus producing a scalar product:
f t = m a x ( f ( t - 1 ) × r + 1 , , f t × r )
F = ( f 1 , f 2 , , f T / r )

3.2. ReLU Nonlinearity

The standard way to model a neuron’s output f as a function of its input x is with f ( x ) = t a n h ( x ) or f ( x ) = ( 1 + e - x ) - 1 . In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity f ( x ) = m a x ( 0 , x ) . Following the work presented in [30], we refer to neurons with this non-linearity as Rectified Linear Units (ReLUs). Deep convolutional neural networks with ReLUs train several times faster than their equivalents with t a n h units. Rectified linear units, compared to the sigmoid function or similar activation functions, allow for faster and effective training of deep neural architectures on large and complex datasets.

3.3. Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) constitutes an effective way of training deep networks, similar to the ones utilized in the present manuscript. Concretely, SGD variants, like momentum Adagrad [31,32], have been used in other studies in order to achieve state-of-the-art performance. What is more, SGD optimizes the parameters Θ of a concrete network so as to minimize the loss, as mentioned in the following equation:
Θ = a r g m i n 1 N i = 1 N l ( x i , Θ )
where x 1 , , N is the training dataset. With the use of SGD, the training phase is utilized in steps, whereas at each step, a mini batch x 1 , , m of size m is considered. Regarding mini-batch, it is commonly used in order to approximate the gradient of the loss function with respect to the parameters. Therefore, the following computation is considered:
1 m l ( x i , Θ ) Θ
The advantages of using mini-batches of several examples instead of using mini-batches of one example are considered as follows. Initially, the gradient of the loss over a mini-batch is regarded an estimate of the gradient over the training set, where its quality improves as the batch size increases. In the following, the parallelism afforded by the modern computing platforms drives the much more efficient computation over a batch than the m computations for different individual examples.
On the other hand, despite the fact that stochastic gradient descent is simple and effective, it requires careful tuning of the model hyper-parameters. These parameters include the learning rate used in optimization, along with the initial values regarding the model parameters. More specifically, the training phase is overly complicated since the inputs to each layer are affected by the parameters of all preceding layers; as a result, small changes in the network parameters are amplified while the network becomes deeper.

3.4. Batch Normalization Layer

Furthermore, the nature of each layer’s parameters implies a great complication as far as the convergence of the model is concerned. A respectable amount of deviations is observed, on the values of the input layers with each passing feature set. Therefore, there emerges a significant need to normalize each feature, within a given amount of previous and subsequent feature vectors, which are denoted as batches. Thus, a multi-dimensional input, which is accepted by a layer, x = ( x ( 1 ) x ( d ) ) , has each of the d dimensions normalized in terms of mean and variance:
x ^ ( k ) = x ( k ) - E [ x ( k ) ] V a r [ x ( k ) ]
Following the application of the aforementioned procedure on the input layers, the newly-normalized values of the vectors are denoted as, x ^ 1 , , m , while the linear transformations of x ^ are denoted as y 1 , , m .
Therefore, the complete layer of the batch normalization is completed as presented:
B N γ , β : x 1 , , m y 1 , , m
The B N transform, as the batch normalizing procedure is denoted, is further elaborated with its according functions in Table 1. The constant ϵ , which is utilized in the corresponding functions, contributes to the numerical stability of the problem.
Table 1. Applying the batch normalizing layer on a given batch of inputs.

3.5. LSTM Layer

The LSTM neural network can be coherently presented as a set of sequential functions that process the data input layer, in between the current and the subsequent hidden states of the algorithm. The final product p t + 1 of the LSTM neural net whose input is denoted as x t , m t and c t the aforementioned sequence of hidden states, is produced through the following equations:
i t = σ ( W i x x t + W i m m t - 1 )
f t = σ ( W f x x t + W f m m t - 1 )
o t = σ ( W o x x t + W o m m t - 1 )
c t = f t c t - 1 + i t t a n h ( W c x x t + W c m m t - 1 )
m t = o t c t
p t + 1 = S o f t m a x ( m t )
What f t denotes is the “forget gates” of the network. The model’s “forget gates” have the capability of ignoring specific past states of certain memory cells. On the other hand, the input gates i t of the model have the power to ignore certain parts of the input in contrast to the previous cell states. The output gate is denoted by o t , whose purpose is to administer a filter to the current memory cell. This is then taken to the final hidden state.
These gates and their conglomerate structure allow this version of recurrent neural networks to map dependencies between specific inputs and the long-term behavior of the network, as well as to avoid gradients whose value is exponentially surging to inconvenient peaks.

3.6. Dropout

The dropout layer [33], which is applied sequentially following the previous stack of filters, acts as a regularizing tool of the input data when deep neural networks are concerned. Let the output product of a neural network, which is L layers deep, be denoted as y ^ , initialized with an according loss function E ( , ) with the softmax loss function as a potential loss function or the Euclidean loss function, which is also often called square loss. Then, all weights of each dimension of the neural network are initialized, through a corresponding matrix, which will contain W i , K i × K i - 1 , as well as b i , which denotes the bias, is applied with each parsing of the network, where the dimensions are K i and each layer i = 1 , , L .
The optimization of the neural network often requires another variable that acts as a regularizer of the input. This variable is frequently the L 2 norm, reduced by a certain λ decay rate, before achieving its latter goal of minimization [21,34]:
L d r o p o u t : = 1 N i = 1 N E ( y i , y i ^ ) + λ i = 1 L ( W i 2 2 + b i 2 2 )
What a dropout makes possible is the sampling of the binary variables of the network, for every input and layer accordingly. These variables are initialized with a certain value of one. The probability for their initialization is p i , where i denotes the specific current layer. If this value is not one, but is initialized with zero, then it is decided that this specific unit is dropped. This procedure and its variables are then utilized during the backward pass propagation of the model’s parameters.
The application of the dropout layer often takes place following the last convolutional layer of a given network, as well as after a recurrent type of layer, such as LSTMs or GRUs, if existent. Below, the inputs x t of the recurrent layers are presented, before the application of said dropout layer, where f t denotes the output, which the last convolutional layer of the stack has produced:
x t = f t
With the addition of the dropout technique, the following are presented:
r t i B e r n o u l l i ( p )
x t = r t f t
The variable p denotes the dropout probability. This is often set to 0 . 5 ; thus, Equation (16) notes that the dropout vectors are produced through the binary nature of the probability density function of Bernoulli. As for r t i , this denotes the i-th element of the binary r t d vectors.

3.7. Input Representation

As has been introduced in the literature of recurrent neural networks, the information contained in the sequence of words S 0 , S 1 , , S t at a certain time t + 1 is represented by a fixed length hidden state h t . The following hidden state constitutes a non-linear function of the past hidden state, as well as the current input, which produces an updated memory that can capture, through time, non-linear dependencies:
h t + 1 = f ( h t , x t + 1 )
Regarding the non-linear function f, it is of utmost importance to efficiently represent the corresponding inputs x of our model. Specifically, the non-linear function that is implemented in the present manuscript is the LSTM one. As mentioned above, the LSTM cell [11] has increasingly become popular in recent years due to the fact that it has the potential to capture long-term dependencies in sequence prediction problems. Moreover, LSTM has the ability to cope with the vanishing/exploding gradient problems in recurrent neural networks.
As it is already known, deep convolutional neural networks have achieved state-of-the-art performances regarding the field of image classification in recent years. Therefore, the use of a convolutional neural network for mapping images I to fixed length vector representations is considered in order to represent images strategically. Specifically, the architecture of GoogleNet [35], which employs an innovative batch normalization technique, is used. In the following, x = C N N ( I ) is considered as a D i × 1 vector, where D i is the fixed dimension of any input that is given to the LSTM.

4. Experiments and Results

4.1. Dataset Description

The dataset used in the experiments was the MNIST database (http://yann.lecun.com/exdb/mnist/) [15]. MNIST consists of a collection of 60,000 hand written digits in image-oriented matrix format. Each image is associated with a specific label, which signifies the expected digit prediction according to the digit intention of the writer. This labeled dataset can be easily adapted to classification algorithms; regarding the current work, the labeled images iteratively trained our proposed neural network. Additionally, our model was also trained on another dataset, CIFAR10, in order to further evaluate the contribution of our work. CIFAR10, just like MNIST, is a collection of images, correspondingly labeled with a number to represent ten different classes.

4.2. Implementation

As presented in Figure 1 and Table 2, the flow of information starts with the layer of the input image. The abbreviations for the different methods utilized in the present manuscript are introduced in Table 3. As provided by the MNIST dataset, all images of the experiments were of a rectangular shape and of 28 × 28 dimensions exactly. In the case of the CIFAR10 dataset, the initial dimensions were 32 × 32 . Each image of the corresponding batch was then proceeded in a series of convolutional layers where the first part consisted of a 32-channel decomposition of the image. The initial kernel size was 5 × 5 , and although the dimensions of the image remained the same, the number of channels constituted units that were easily adapted to the ReLU (Rectified Linear Units) activation function. The following layer consisted of a max pooling procedure where all 32 channels were directly fed to a sample-based discretization process of max pooling. There, the dimensionality of the input representation was reduced, according to the decided kernel size. Since the pooling’s kernel was 2 × 2 , our new dimensions for the channels, known as feature maps, were 14 × 14 ( 16 × 16 for CIFAR10 dataset). To conclude the initial set of layers, the batch normalization technique was performed. Concretely, the gradients of the batch were less vulnerable to outliers, and a normalized range was created within the mini-batch, thus accelerating the learning process by allowing a faster convergence for the model. Please note that the aforementioned process was repeated two and three times for the two different proposed models, respectively.
Figure 1. Network infrastructure (where F M i is the i-th feature map and P i is the i-th pooling layer).
Table 2. Abbreviations for different methods utilized.
Table 3. Abbreviations for different methods utilized.
Thus, the first step of our proposed schema consisted of this particular set of layers and could be replicated in matters of iterations and variable initialization. The second step of our process consisted of the restructuring of the LSTM units’ layer in order to accelerate the process accordingly. As elaborated in the Introduction, in our proposed schema, a number of LSTM units was interposed between the convolutional and the dense layers. A separate LSTM layer was assigned for each feature map produced by the convolutional layers. Their output was directly relevant to the selected number of output units, which was initialized as 1 in our schema, and was constructed together through a recompositioning step, arranging all output vectors as a 2-dimensional single output. Before completing the schema with the dense and logit layers, the constructed LSTM output was flattened in a single long vector, in order to adapt to the inputs of the conclusive fully-connected layer and produce the probabilities of the ten-digit classifier.

4.3. Evaluation

Regarding the results as presented in Table 4, Table 5, Table 6 and Table 7, as well as Figure 2 and Figure 3, a safe assumption can be extracted concerning the fluctuations in accuracy of the concerned methods. More specifically, the performance of the proposed schemes in terms of accuracy, loss (cross-entropy), in accordance with execution time (in seconds) has been measured. All results are comprised of the average out of five different runs in order to avoid outliers.
Table 4. Accuracy, loss and time for different numbers of epochs ( 10 % sample) for the MNIST dataset.
Table 5. Accuracy, loss and time for different numbers of epochs ( 20 % sample) for the MNIST dataset.
Table 6. Accuracy, loss and time for different numbers of epochs ( 10 % sample) for the CIFAR10dataset.
Table 7. Accuracy, loss and time for different numbers of epochs ( 20 % sample) for the CIFAR10 dataset.
Figure 2. Accuracy (percentage) for different numbers of epochs for CIFAR10 and MNIST.
Figure 3. Time (s) for different numbers of epochs for CIFAR10 and MNIST.
As expected, the baseline approach that contained a regular convolutional scheme maintained a steady and linear progression in its accuracy, as well as in the time needed for the completion of the experiments. In the event of including a batch normalization layer after the convolutional net, the total completion time of the corresponding experiments marginally increased. That chronic increment amounts to the sum of computations applied during the reduction of the covariate shift.
Making the convolutional network deeper, thus more complex in dimensionality, will affect the model’s performance. In the present work, convolutional networks of various depths were implemented in order to detect the version that could provide the better time-accuracy trade-off. Indeed, the more complex CNN managed to achieve higher percentages of accuracy. Despite the time needed, accuracy was increased in high ranks, which made the trade-off suboptimal.
The conglomerate structure of the previous baseline network, in sequence with our additional LSTM neural net logic, yielded results within our expected range of values. The precision of our novel approach scaled increasingly as the model was trained with additional epochs. In comparison with the baseline methods, similar results were achieved with a standard deviation of 2 × 10 - 3 , while also occasionally yielding surpassing accuracy scores. However, on the contrary, in terms of the experiment completion time, the results highlighted our approach as importantly more rapid in execution. While maintaining similar amounts of accuracy, our novel CNN-LSTM technique managed to reduce the execution time by 30 % during the first unstable epochs and by approximately 42 % after convergence.
More specifically, after 50 epochs of training on the MNIST dataset, our model achieved a 36 % and 44 % improvement on its results on the 2-CNN model and 3-CNN model, respectively; that is, regarding the comparison of the model that contained our contribution with the CNN-LSTM architecture against the model that achieved the highest accuracy. With the gradual increase of the data that were supplied for the training step of the models, their percentages were scaled to 42 % and 40 % , again, respectively. Regarding the experiments on the CIFAR10 database, our model provided similar results in terms of accuracy and execution time, as expected. The first results of the execution time comparison presented a 42 % improvement on both the 2-CNN, as well as 3-CNN layer models. Providing a bigger amount of data, just like the previous experiment with MNIST, yielded an improvement of 41 % and 43 % accordingly.

5. Conclusions and Future Work

In the proposed work, a methodology utilizing convolutional networks as well as batch normalization and LSTM neural net layers for achieving the best trade-off in terms of time and accuracy regarding the image classification problem, have been presented. In addition, Rectified Linear Units (ReLUs) nonlinearity is introduced for faster training of deep convolutional neural networks than their equivalents with the sigmoid function or similar activation functions on large and complex datasets. On the other hand, dropout is an effective way, aiming at regularizing deep neural networks; with this process, binary variables for every input point and for every network unit in each layer are sampled.
The baseline approach, which contained a regular convolutional scheme, maintained a steady and linear progression in its accuracy, as well as in the time needed for completion of the experiments, whereas after the import of the batch normalization layer, the overall time of the corresponding experiments marginally increased. As the convolutional network deepens, thus becoming more complex in dimensionality, the model’s performance becomes affected. The more complex the CNN, the higher the percentages of accuracy it managed to achieve, and despite the time needed, accuracy was increased to high ranks, which made the trade-off suboptimal. Furthermore, the accuracy of our novel approach scaled increasingly as the model was trained with additional epochs. While maintaining similar amounts of accuracy, our novel CNN-LSTM technique was successful at reducing the execution time by 30 % during the first unstable epochs and by approximately 42 % after convergence.
As future work, we plan to design more comprehensive CNN models as these models have achieved significant success in various computer vision tasks, including image classification, object detection, image retrieval and image captioning. It is our thought that artificial intelligence should be capable of tackling a broad set of computer vision problems. Therefore, a future plan deals with the revisiting of the vast amount of hyper parameters that such deep architectures present. The fine-tuning of the proposed model could highlight a new set of latent patterns whose appearance could not be previously detected. Furthermore, regarding the theoretical background of the work, a potential approach could be implemented concerning the complexity of the architecture. As a model deepens in terms of layers, and simultaneously in the size of its graph, there emerge new ways for defining the optimal connection within this stack of layers, and algorithmic schemes are still to be introduced.

Author Contributions

A.S., A.K., P.M. and S.S. conceived of the idea, designed and performed the experiments, analyzed the results, drafted the initial manuscript and revised the final manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Guo, Y.; Liu, Y.; Oerlemans, A.; Lao, S.; Wu, S.; Lew, M.S. Deep Learning for Visual Understanding: A Review. Neurocomputing 2016, 187, 27–48. [Google Scholar] [CrossRef]
  2. Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
  3. Bay, H.; Tuytelaars, T.; Gool, L.J.V. SURF: Speeded Up Robust Features. In Proceedings of the European Conference on Computer Vision (ECCV), Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar]
  4. Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
  5. Sivic, J.; Zisserman, A. Video Google: A Text Retrieval Approach to Object Matching in Videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Nice, France, 13–16 October 2003; pp. 1470–1477. [Google Scholar]
  6. Simo-Serra, E.; Trulls, E.; Ferraz, L.; Kokkinos, I.; Fua, P.; Moreno-Noguer, F. Discriminative Learning of Deep Convolutional Feature Point Descriptors. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 118–126. [Google Scholar]
  7. Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1701–1708. [Google Scholar]
  8. Hinton, G.E.; Osindero, S.; Teh, Y.W. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef] [PubMed]
  9. Donahue, J.; Hendricks, L.A.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Darrell, T.; Saenko, K. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
  10. Kiros, R.; Salakhutdinov, R.; Zemel, R.S. Multimodal Neural Language Models. In Proceedings of the 31th International Conference on Machine Learning (ICML), Beijing, China, 21–26 June 2014; pp. 595–603. [Google Scholar]
  11. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  12. Sainath, T.N.; Vinyals, O.; Senior, A.W.; Sak, H. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia, 19–24 April 2015; pp. 4580–4584. [Google Scholar]
  13. Ciresan, D.C.; Meier, U.; Schmidhuber, J. Multi-Column Deep Neural Networks for Image Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June2012; pp. 3642–3649. [Google Scholar]
  14. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), Lake Tahoe Nevada, 3–8 December 2012; pp. 1106–1114. [Google Scholar]
  15. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  16. LeCun, Y.; Huang, F.J.; Bottou, L. Learning Methods for Generic Object Recognition with Invariance to Pose and Lighting. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Washington, DC, USA, 27 June–2 July 2004; pp. 97–104. [Google Scholar]
  17. Lee, H.; Grosse, R.B.; Ranganath, R.; Ng, A.Y. Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), Montreal, QC, Canada, 14–18 June 2009; pp. 609–616. [Google Scholar]
  18. Turaga, S.C.; Murray, J.F.; Jain, V.; Roth, F.; Helmstaedter, M.; Briggman, K.L.; Denk, W.; Seung, H.S. Convolutional Networks Can Learn to Generate Affinity Graphs for Image Segmentation. Neural Comput. 2010, 22, 511–538. [Google Scholar] [CrossRef] [PubMed]
  19. Brown, M.A.; Hua, G.; Winder, S.A.J. Discriminative Learning of Local Image Descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 43–57. [Google Scholar] [CrossRef] [PubMed]
  20. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv, 2015; arXiv:1502.03167. [Google Scholar]
  21. Gal, Y.; Ghahramani, Z. Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference. arXiv, 2015; arXiv:1506.02158. [Google Scholar]
  22. Gourgaris, P.; Kanavos, A.; Makris, C.; Perrakis, G. Review-based Entity-ranking Refinement. In Proceedings of the 11th International Conference on Web Information Systems and Technologies (WEBIST), Lisbon, Portugal, 20–22 May 2015; pp. 402–410. [Google Scholar]
  23. Kanavos, A.; Kafeza, E.; Makris, C. Can We Rank Emotions? A Brand Love Ranking System for Emotional Terms. In Proceedings of the IEEE International Congress on Big Data, New York, NY, USA, 27 June–2 July 2015; pp. 71–78. [Google Scholar]
  24. Bütepage, J.; Black, M.J.; Kragic, D.; Kjellström, H. Deep Representation Learning for Human Motion Prediction and Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1591–1599. [Google Scholar]
  25. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  26. Glorot, X.; Bengio, Y. Understanding the Difficulty of Training Deep Feedforward Neural Networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
  27. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
  28. Chung, J.; Gülçehre, Ç.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv, 2014; arXiv:1412.3555. [Google Scholar]
  29. Tsironi, E.; Barros, P.V.A.; Weber, C.; Wermter, S. An Analysis of Convolutional Long Short-Term Memory Recurrent Neural Networks for Gesture Recognition. Neurocomputing 2017, 268, 76–86. [Google Scholar] [CrossRef]
  30. Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
  31. Duchi, J.C.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
  32. Sutskever, I.; Martens, J.; Dahl, G.E.; Hinton, G.E. On the Importance of Initialization and Momentum in Deep Learning. In Proceedings of the International Conference on Machine Learning (ICML), Atlanta, GA, USA, 16–21 June 2013; pp. 1139–1147. [Google Scholar]
  33. Srivastava, N.; Hinton, G.E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  34. Gal, Y.; Ghahramani, Z. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1019–1027. [Google Scholar]
  35. Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and Tell: A Neural Image Caption Generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.