# An Empirical Review of Automated Machine Learning

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Machine Learning Paradigms

## 3. Related Work

#### 3.1. AutoML Surveys and Reviews

#### 3.2. AutoML Applications

## 4. The Experimental Path

#### 4.1. First Experimental Session

#### 4.1.1. First Experiment (Genetic Algorithms and BrainF*ck (BF) Language)

**BrainF*ck (BF) language.**To simulate the behavior of a Turing Machine without however having to generate code in a verbose programming language, the article proposes to use the BrainF*ck (BF) language designed in 1993 by Urban Müller. A useful feature of the BF language is the very small set of instructions sufficient to make the language Turing complete. BF is an esoteric language, difficult to program. However, instructions and code share a common format: ASCII coding. The input, the output, and the code are strings and the instructions consist of a single character. To avoid having to also generate the keyboard output, the “,” character was removed from the instructions in the experiment of the aforementioned article. Also for this experiment, the output of the program is considered what it prints, not what remains on the tape at the end of the program. To demonstrate the potential of this system, the article shows some experiments that generate words or sentences in English. This forced approach allows for limited efficacy, several generations in the order of magnitude of one million are needed to generate a program capable of writing a sentence. In general, applying random generation to code increases the risk of running into syntax errors. To improve the ability of the system to generate the correct code, we considered several alternatives. To check the code it is sufficient to use a type-2 grammar (see Equation (1)), to generate the code we need a pushdown automaton (PDA).

#### 4.1.2. Second Experiment (Sequential Model)

- the ‘ character inserts the instruction preceding ’ inside the input tape;
- the * character overwrites the currently pointed cell with the instruction preceding *;
- the ⌃ character removes the currently pointed cell;
- the / character writes the zero value in the currently pointed cell;
- the ; character splits the structures in the definition of a parent architecture;
- the , character splits the building blocks in the definition of a structure;
- the . character splits the code that the block can generate.

#### 4.2. Second Experimental Session

**Neural architecture search (NAS).**This discipline proposes a connectionist approach to the problem of learning to learn. The different architectures are distinguished by the upstream learning process. With this method, the parameters of the underlying connectionist model are inferred. Early attempts model the system as a Reinforcement Learning (RL) problem [44]. The problem definition was then mathematically formalized. The search space is composed of all the combinations of possible parameters of the connectionist network to be modeled. Starting from this definition, in the Bayesian paradigm Autokeras [45] and NASBOT [46] are proposed. Evolutionaries contribute with Hierarchical Evo [47] and AmoebaNet [48]. We have not found any symbolist NAS applications. For the analogizer paradigm, on the other hand, we have found a strong relationship with the most successful algorithms in this discipline. The first problem we faced in the application of connectionist techniques to NAS is raised by the discrete nature of the research space. This problem, common also to unsupervised and Reinforcement Learning, is solved by techniques attributable to reasoning by analogy. The first published work [44] describing this process gave its name to the NAS field. In this experiment, two connectionist models were optimized: convolutional neural networks and recurrent neural networks. The parent network, or controller network, is implemented through a recurrent neural network. The operating principle is simple: the controller network samples architectures for the generated network. Each generated network is trained to a limited extent on the original problem and a reward signal is returned to the controller based on the accuracy of the results. This model has been successfully applied to the image classification problem through convolutional neural networks. In the aforementioned article, competitive results to the State of the Art are shown. However, the architecture generation is extremely inefficient due to its formulation. The experiment shown in the publication, for the classification of images, ran for several days before it could converge. Even adopting all known optimization techniques and using a dedicated graphics processing unit (GPU), this research extends over several days. The inefficiency is because the controller architecture has always to generate the entire sequence of all connections. In RL systems, however, it is possible to choose how to model the actions of the controller. In Efficient Neural Architecture Search (ENAS) [49], all generated architectures are modeled as a composition of blocks (subnets) that share weights. In Progressive Neural Architecture Search [50], the recurrent neural network that acts as a controller sequentially generates network transformation operations instead of returning the entire sequence representing the generated architecture at each step. Many other research works propose NAS algorithms, each adapted to a specific research problem. In [51], the concept of search space is unified, a concept common to all these methods. The research space consists of the set of parameters, variables, and structures, needed for defining a connectionist neural architecture. The search space is the set of possible values assumed by the hyperparameters of the network. Such values can be continuous, discrete, numeric variables or they can be categorical values, ordinal, or nominal variables.

**Attention mechanism.**One of the broadest fields of application of the analogizer paradigm is automatic text translation. This domain has seen a rapid rise in Machine Learning, passing through different models up to the most recent architectures. A problem that does not concern only the automatic text translation and which is partially solved by the attention mechanism is having to select within the input the characteristics needed for the system to compute the output. The attention mechanism was developed just to overcome this problem. In this way, it is possible to model a many-to-many relationship between the words of the two translated sentences. Generally speaking, the integration of the attention mechanism enables the system to focus on a portion of the input instead of considering it in its entirety. In the specific case of automatic text translation, the whole input is the sentence to be translated: through the attention mechanism, it is possible to translate sets of words into other sets of words. Furthermore, the system can preserve the specific positioning of the words in the sentence. This system is also used intensively in the Computer Vision domain to isolate the components identified in images. The operating principle of the attention mechanism takes its cue from the master algorithm of the analogizer paradigm. To simplify the concept, the attention mechanism is nothing more than a property of the output defined in the input. In automatic text translation, it is like wondering which words influenced the translation in a specific word. In image classification, it is like asking which part of the image motivated the choice of the class. Basically, it is a mechanism that occurs in a human being by reasoning by analogy. Specifically, if we can classify an object based on a localized physical characteristic, recognizing the same object a second time will be much faster since we can turn our attention to the detail that we know how to recognize. Among the most common and exhaustively studied applications, which require an elementary inference mechanism similar to that necessary for the search for neural architectures, we find automatic text translation, game solving, and many other complex tasks. Therefore, in the subsequent experimental session, we analyzed RL architectures and some models that jointly exploit the connectionist, Bayesian, and analogizer paradigms.

**Neural Arithmetic Logic Units (NALU).**Linear activation functions are known to produce linear output. It is also known that to overcome this limitation it is possible to adopt non-linear functions with properties that make derivation simple. However, it is less known that these functions introduce another limitation. Neural networks tend not to generalize for values outside the range of training examples. If an artificial neural network is trained with data contained within a range to approximate a certain algorithm, such as the sum of its inputs, it will not be able to generalize this sum outside that range. Figure 4 shows the validation error for an autoencoding task algorithm with various activation functions.

**Differentiable Neural Computer (DNC).**In the scheme of the long short-term memory (LSTM) [53], we have replaced the activation functions with NALUBTC cells. We did not obtain significant results, so we tried to integrate this model into a DNC. A Neural Turing Machine (NTM) consists of a controller network, read heads, write heads, and a memory. The memory is accessed via an addressing system, the key encodes both the information to be searched for and the search method. The search methods are two: content-based or address-based. The inputs, together with the previous reading, are served to the controller network (often an LSTM network), which returns the output. The output is used as a reading and writing key for the heads. This system is a kind of generalization of an LSTM where the state is an entire $M\times N$ matrix that is accessed by content or by address. In an LSTM, the state is a parameter that affects the operation of the neural network at the current step (${h}_{t}$). In an NTM, by analogy, the state represents the sequence of instructions to be computed in a given condition. In an NTM, the number of states is limited by the memory size. In the next model, that is, Differentiable Neural Computers (DNC) (Figure 8), the focus was on this problem, allowing for the overwriting of memory. In these models, the attention mechanism is exploited to distribute read and write accesses to memory. This way, the problem of having to access memory locations with discrete addresses does not exist, we write and read each cell at the same time, however to a different extent. This application of the attention mechanism shows how it is possible to transform a problem of a discrete and non-derivable nature into one of a continuous nature. This model is able to solve algorithmic problems of a discrete nature such as graph exploration and to model these structures in its own memory. Therefore, we are interested in applying DNCs to more complex structures.

#### 4.3. Third Experimental Session

**Attention, routing, and capsules.**In the third experimental session, we identified three ML mechanisms, commonly used in the aforementioned applications of automatic text translation and Reinforcement Learning, which are at least in part the result of reasoning by analogy. The attention mechanism suggests an “algorithmic path” to the network on which it is applied. It hides unnecessary information and highlights useful information. By influencing the inference in the underlying network, this method could also be considered a routing algorithm. Routing in neural networks generally occurs on computation, in particular, it can help decide whether or not to move through a sub-network of the network. Under the forced assumption that attention is a routing algorithm, attention applies to inputs, not edges.

#### 4.3.1. First Experiment

#### 4.3.2. Second Experiment

#### 4.3.3. Third Experiment

## 5. Empirical Evaluation

**first column (Game)**. The

**second column (Algorithm)**denotes the algorithm on which the tested model is based. In particular, we tested the Actor-Critic (AC) [64] algorithm and the Deep-Q Network (DQN) [65] algorithm. We chose these two algorithms because they are simple, but at the same time efficient. We used the code written by the authors of the papers, where provided, or we implemented the algorithms by scratch. The comparative analysis was performed using the Facebook PyTorch library (https://pytorch.org/ (Accessed: 6 January 2021)). The

**third column (Vision)**indicates whether the architecture involves the use of a computer vision module or whether the input of the model is directly a compressed representation of the environment state. In particular, we integrated a computer vision system based on convolutional networks into the original AC architecture. It should be noted that we did not implement a visionless DQN model and that, generally, its introduction increased the number of episodes needed to solve the game. For models involving image reconstruction, we adapted the optimization model suggested in [66]. During the training process, we mask out all but the activity vector of the last chosen action capsule. Then, we use this activity vector to reconstruct the input image. The output of the action capsule is fed into a decoder. We minimize the sum of squared differences between the outputs of the logistic units and the pixel intensities. We scale down this reconstruction loss so that it does not dominate the policy loss during training. We experimented with different approaches for defining the vision system. In the first models, we use the current game frame directly as input to the system. Once an action has been chosen, the decoder has then to be able to reconstruct the next frame, foreseeing the resulting transition. Subsequent models apply the difference between the two previous frames as input. Finally, the more complex model uses a weighted average of the light intensity of all previous frames, and the decoder, similarly, generates the path as a weighted average of the future predictions. The

**fourth column (Capsule Routing)**shows the role of the capsule routing within the architecture. Specifically, vision means that the capsule network was responsible for inferring a compressed representation of the state of the game starting from a frame. Differently, by policy, we mean that routing has the role of directly choosing the actions to be performed. Specifically, we replaced the two main components of the architecture with capsule networks. Capsule routing can replace both the neural network that represents the actor and the critic, and the convolutional network that extracts a compressed representation of the state starting from a frame or a transition. Other variants differ for how the loss function is defined, the state is modeled, or for the introduction of an additional component responsible for reconstructing the transitions, the image, or the state. For instance, the first AC models that substitute capsule networks for politics map a hidden layer of capsules to the state vector and the last capsule layer to the actor and critic. We introduced as an additional loss function the squared difference between the vector of the activated capsule in the layer associated with the state and the state actually reached. The goal of AC with corrections is, instead, to decouple the capsule network structure from the state structure by introducing the decoder and applying it to the activated capsule of the final layer. The activated capsule of the final layer represents both a coding of the state reached in the next step and the probability of choice of actions. In AC with memory, the state consists of the concatenation of the current and past observations. In AC without policy, we deviated more from the definition of the AC architecture. In fact, this model does not properly have an actor. The model provides a description of the state reached for each action and the choice is made by evaluating the states reached on a combinatorial basis. The

**fifth column (Optimization Algorithm)**specifies the optimization algorithm used to identify, through a series of iterations, those weight values such that the cost function has the minimum value. Among the possible optimizers, we carried out experiment trials with Adam [67], Rectified Adam (RAdam) [68], and RMSProp [69]. The choice of these optimizers was suggested by the State of the Art. In particular, where mentioned, we used the same optimizer adopted by the authors of the paper. Often for DQN policy training is optimized using Root Mean Square Propagation (RMSProp). This optimization algorithm, proposed by Geoffrey Hinton, chooses a different learning rate for each parameter. Moreover, the learning rates are automatically adjusted to try to dampen the oscillations for gradient descent paths that present a pathological curve like a ravine in the loss surface. Another popular technique that is used along with stochastic gradient descent is called momentum. Instead of using only the gradient of the current step to guide the search, momentum also accumulates the gradient of the past steps to determine the direction to go. Adam or Adaptive Moment Optimization algorithms combine the heuristics of both momentum and RMSProp. Rectified Adam, or RAdam, is a variant of the Adam stochastic optimizer that introduces a term to rectify the variance of the adaptive learning rate. It seeks to tackle the bad convergence problem suffered by Adam.

**sixth column (Learning Rate)**shows the learning rate, namely, a tuning parameter of the optimization algorithm that determines the step size followed at each iteration in approaching a minimum of the loss function. Also in this case we kept the original setting, where provided by the authors. The

**seventh column (Number of Episodes)**reports the maximum number of episodes on which the algorithm was tested. The

**eighth column (Solved)**refers to whether the game was solved or not. For example, for the CartPole-v0 game, the value “Yes” indicates that the model was able to complete ten episodes consecutively. To complete an episode, it is necessary to keep the pole balanced for 200 consecutive frames. If this value is positive, the maximum number of episodes for this experiment is the number of episodes needed to converge.

## 6. Conclusions and Future Works

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Caldarelli, S.; Feltoni Gurini, D.; Micarelli, A.; Sansonetti, G. A Signal-Based Approach to News Recommendation. In CEUR Workshop Proceedings; CEUR-WS.org: Aachen, Germany, 2016; Volume 1618. [Google Scholar]
- Biancalana, C.; Gasparetti, F.; Micarelli, A.; Miola, A.; Sansonetti, G. Context-aware Movie Recommendation Based on Signal Processing and Machine Learning. In Proceedings of the 2nd Challenge on Context-Aware Movie Recommendation, CAMRa ’11, Chicago, IL, USA, 27 October 2011; ACM: New York, NY, USA, 2011; pp. 5–10. [Google Scholar]
- Onori, M.; Micarelli, A.; Sansonetti, G. A Comparative Analysis of Personality-Based Music Recommender Systems. In CEUR Workshop Proceedings; CEUR-WS.org: Aachen, Germany, 2016; Volume 1680, pp. 55–59. [Google Scholar]
- Sansonetti, G.; Gasparetti, F.; Micarelli, A.; Cena, F.; Gena, C. Enhancing Cultural Recommendations through Social and Linked Open Data. User Model. User-Adapt. Interact.
**2019**, 29, 121–159. [Google Scholar] - Sansonetti, G. Point of Interest Recommendation Based on Social and Linked Open Data. Pers. Ubiquitous Comput.
**2019**, 23, 199–214. [Google Scholar] - Fogli, A.; Sansonetti, G. Exploiting Semantics for Context-Aware Itinerary Recommendation. Pers. Ubiquitous Comput.
**2019**, 23, 215–231. [Google Scholar] - Feltoni Gurini, D.; Gasparetti, F.; Micarelli, A.; Sansonetti, G. Temporal People-to-people Recommendation on Social Networks with Sentiment-based Matrix Factorization. Future Gener. Comput. Syst.
**2018**, 78, 430–439. [Google Scholar] - Wolpert, D.H.; Macready, W.G. No free lunch theorems for optimization. IEEE Trans. Evol. Comput.
**1997**, 1, 67–82. [Google Scholar] - Yao, Q.; Wang, M.; Escalante, H.J.; Guyon, I.; Hu, Y.; Li, Y.; Tu, W.; Yang, Q.; Yu, Y. Taking Human out of Learning Applications: A Survey on Automated Machine Learning. arXiv
**2018**, arXiv:1810.13306. [Google Scholar] - Waring, J.; Lindvall, C.; Umeton, R. Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artif. Intell. Med.
**2020**, 104, 101822. [Google Scholar] [PubMed] - Hilbert, D. Die grundlagen der mathematik. In Die Grundlagen der Mathematik; Springer: Berlin/Heidelberg, Germany, 1928; pp. 1–21. [Google Scholar]
- Church, A. An Unsolvable Problem of Elementary Number Theory. Am. J. Math.
**1936**, 58, 345–363. [Google Scholar] - Turing, A.M. On computable numbers, with an application to the Entscheidungsproblem. Proc. Lond. Math. Soc.
**1937**, 2, 230–265. [Google Scholar] - Heekeren, H.R.; Marrett, S.; Ungerleider, L.G. The neural systems that mediate human perceptual decision making. Nat. Rev. Neurosci.
**2008**, 9, 467–479. [Google Scholar] - Vaccaro, L.; Sansonetti, G.; Micarelli, A. Automated Machine Learning: Prospects and Challenges. In Proceedings of the Computational Science and Its Applications—ICCSA 2020, Cagliari, Italy, 1–4 July 2020; Springer International Publishing: Cham, Switzerland, 2020; Volume 12252 LNCS, pp. 119–134. [Google Scholar]
- Domingos, P. The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World; Basic Books: New York, NY, USA, 2015. [Google Scholar]
- Elsken, T.; Metzen, J.H.; Hutter, F. Neural Architecture Search: A Survey. J. Mach. Learn. Res.
**2019**, 20, 55:1–55:21. [Google Scholar] - Fox, G.C.; Glazier, J.A.; Kadupitiya, J.C.S.; Jadhao, V.; Kim, M.; Qiu, J.; Sluka, J.P.; Somogyi, E.T.; Marathe, M.; Adiga, A.; et al. Learning Everywhere: Pervasive Machine Learning for Effective High-Performance Computation. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Rio de Janeiro, Brazil, 20–24 May 2019; pp. 422–429. [Google Scholar]
- Meier, B.B.; Elezi, I.; Amirian, M.; Dürr, O.; Stadelmann, T. Learning Neural Models for End-to-End Clustering. In Artificial Neural Networks in Pattern Recognition; Springer International Publishing: Cham, Switzerland, 2018; pp. 126–138. [Google Scholar]
- Hutter, F.; Kotthoff, L.; Vanschoren, J. (Eds.) Automated Machine Learning—Methods, Systems, Challenges; The Springer Series on Challenges in Machine Learning; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
- Zöller, M.A.; Huber, M.F. Benchmark and Survey of Automated Machine Learning Frameworks. arXiv
**2019**, arXiv:1904.12054. [Google Scholar] - Escalante, H.J. Automated Machine Learning—A brief review at the end of the early years. arXiv
**2020**, arXiv:2008.08516. [Google Scholar] - Liu, Z.; Xu, Z.; Madadi, M.; Junior, J.J.; Escalera, S.; Rajaa, S.; Guyon, I. Overview and unifying conceptualization of automated machine learning. In Proceedings of the Automating Data Science Workshop, Wurzburg, Germany, 20 September 2019. [Google Scholar]
- He, X.; Zhao, K.; Chu, X. AutoML: A survey of the state-of-the-art. Knowl.-Based Syst.
**2021**, 212. [Google Scholar] [CrossRef] - Vanschoren, J. Meta-Learning. In Automated Machine Learning: Methods, Systems, Challenges; Springer International Publishing: Cham, Switzerland, 2019; pp. 35–61. [Google Scholar]
- Feurer, M.; Hutter, F. Hyperparameter Optimization. In Automated Machine Learning: Methods, Systems, Challenges; Springer International Publishing: Cham, Switzerland, 2019; pp. 3–33. [Google Scholar]
- Shawi, R.E.; Maher, M.; Sakr, S. Automated Machine Learning: State-of-The-Art and Open Challenges. arXiv
**2019**, arXiv:1906.02287. [Google Scholar] - Thornton, C.; Hutter, F.; Hoos, H.H.; Leyton-Brown, K. Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, Chicago, IL, USA, 11–14 August 2013; Association for Computing Machinery: New York, NY, USA, 2013; pp. 847–855. [Google Scholar]
- Ren, P.; Xiao, Y.; Chang, X.; Huang, P.; Li, Z.; Chen, X.; Wang, X. A Comprehensive Survey of Neural Architecture Search: Challenges and Solutions. arXiv
**2020**, arXiv:2006.02903. [Google Scholar] - Wistuba, M.; Rawat, A.; Pedapati, T. A Survey on Neural Architecture Search. arXiv
**2019**, arXiv:1905.01392. [Google Scholar] - Chen, Y.; Song, Q.; Hu, X. Techniques for Automated Machine Learning. arXiv
**2019**, arXiv:1907.08908. [Google Scholar] - Frazier, P.I. Bayesian Optimization. In Recent Advances in Optimization and Modeling of Contemporary Problems; PubsOnLine: Catonsville, MD, USA, 2018; Chapter 11; pp. 255–278. [Google Scholar]
- Zöller, M.; Huber, M.F. Survey on Automated Machine Learning. arXiv
**2019**, arXiv:1904.12054. [Google Scholar] - Tuggener, L.; Amirian, M.; Rombach, K.; Lorwald, S.; Varlet, A.; Westermann, C.; Stadelmann, T. Automated Machine Learning in Practice: State of the Art and Recent Results. In Proceedings of the 6th Swiss Conference on Data Science (SDS), Bern, Switzerland, 14 June 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
- Chung, C.; Chen, C.; Shih, W.; Lin, T.; Yeh, R.; Wang, I. Automated machine learning for Internet of Things. In Proceedings of the 2017 IEEE International Conference on Consumer Electronics, (ICCE-TW), Taipei, Taiwan, 12–14 June 2017; pp. 295–296. [Google Scholar]
- Li, Z.; Guo, H.; Wang, W.M.; Guan, Y.; Barenji, A.V.; Huang, G.Q.; McFall, K.S.; Chen, X. A Blockchain and AutoML Approach for Open and Automated Customer Service. IEEE Trans. Ind. Inform.
**2019**, 15, 3642–3651. [Google Scholar] - Di Mauro, M.; Galatro, G.; Liotta, A. Experimental Review of Neural-Based Approaches for Network Intrusion Management. IEEE Trans. Netw. Serv. Manag.
**2020**, 17, 2480–2495. [Google Scholar] [CrossRef] - Maipradit, R.; Hata, H.; Matsumoto, K. Sentiment Classification Using N-Gram Inverse Document Frequency and Automated Machine Learning. IEEE Softw.
**2019**, 36, 65–70. [Google Scholar] [CrossRef][Green Version] - Shi, X.; Wong, Y.; Chai, C.; Li, M. An Automated Machine Learning (AutoML) Method of Risk Prediction for Decision-Making of Autonomous Vehicles. IEEE Trans. Intell. Transp. Syst.
**2020**, 1–10. [Google Scholar] [CrossRef] - Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
- Schmidhuber, J. Optimal ordered problem solver. Mach. Learn.
**2004**, 54, 211–254. [Google Scholar] [CrossRef][Green Version] - Trask, A.; Hill, F.; Reed, S.; Rae, J.; Dyer, C.; Blunsom, P. Neural Arithmetic Logic Units. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS), NIPS’18, Montreal, QC, Canada, 3–8 December 2018; Curran Associates Inc.: Red Hook, NY, USA, 2018. [Google Scholar]
- Graves, A.; Wayne, G.; Reynolds, M.; Harley, T.; Danihelka, I.; Grabska-Barwińska, A.; Colmenarejo, S.G.; Grefenstette, E.; Ramalho, T.; Agapiou, J.; et al. Hybrid computing using a neural network with dynamic external memory. Nature
**2016**, 538, 471–476. [Google Scholar] [CrossRef] - Zoph, B.; Le, Q.V. Neural Architecture Search with Reinforcement Learning. arXiv
**2016**, arXiv:1611.01578. [Google Scholar] - Jin, H.; Song, Q.; Hu, X. Auto-keras: An efficient neural architecture search system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1946–1956. [Google Scholar]
- Kandasamy, K.; Neiswanger, W.; Schneider, J.; Poczos, B.; Xing, E.P. Neural Architecture Search with Bayesian Optimisation and Optimal Transport. In Advances in Neural Information Processing Systems (NIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2018; pp. 2016–2025. [Google Scholar]
- Liu, H.; Simonyan, K.; Vinyals, O.; Fernando, C.; Kavukcuoglu, K. Hierarchical Representations for Efficient Architecture Search. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Real, E.; Aggarwal, A.; Huang, Y.; Le, Q.V. Regularized Evolution for Image Classifier Architecture Search. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI), Honolulu, HI, USA, 27 January–1 February 2019; pp. 4780–4789. [Google Scholar]
- Pham, H.; Guan, M.Y.; Zoph, B.; Le, Q.V.; Dean, J. Efficient neural architecture search via parameter sharing. arXiv
**2018**, arXiv:1802.03268. [Google Scholar] - Liu, C.; Zoph, B.; Neumann, M.; Shlens, J.; Hua, W.; Li, L.J.; Fei-Fei, L.; Yuille, A.; Huang, J.; Murphy, K. Progressive Neural Architecture Search. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 19–35. [Google Scholar]
- Jastrzębski, S.; de Laroussilhe, Q.; Tan, M.; Ma, X.; Houlsby, N.; Gesmundo, A. Neural Architecture Search Over a Graph Search Space. arXiv
**2018**, arXiv:1812.10666. [Google Scholar] - Chen, T.Q.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D.K. Neural ordinary differential equations. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Munich, Germany, 8–14 September 2018; pp. 6571–6583. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] - Hinton, G.E.; Krizhevsky, A.; Wang, S.D. Transforming auto-encoders. International Conference on Artificial Neural Networks, Espoo, Finland, 14–17 June 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 44–51. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
- McGill, M.; Perona, P. Deciding how to decide: Dynamic routing in artificial neural networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 2363–2372. [Google Scholar]
- Hassan, H.A.M.; Sansonetti, G.; Gasparetti, F.; Micarelli, A. Semantic-based Tag Recommendation in Scientific Bookmarking Systems. In Proceedings of the ACM RecSys 2018, Vancouver, BC, Canada, 2–7 October 2018; ACM: New York, NY, USA, 2018; pp. 465–469. [Google Scholar]
- Hahn, T.; Pyeon, M.; Kim, G. Self-Routing Capsule Networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 7656–7665. [Google Scholar]
- Choi, J.; Seo, H.; Im, S.; Kang, M. Attention routing between capsules. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), Seoul, Korea, 27–28 October 2019; pp. 1981–1989. [Google Scholar]
- Hinton, G.E.; Sabour, S.; Frosst, N. Matrix capsules with EM routing. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Bellemare, M.G.; Naddaf, Y.; Veness, J.; Bowling, M. The Arcade Learning Environment: An Evaluation Platform for General Agents. J. Artif. Intell. Res.
**2013**, 47, 253–279. [Google Scholar] [CrossRef] - Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M.A. Playing Atari with Deep Reinforcement Learning. arXiv
**2013**, arXiv:1312.5602. [Google Scholar] - Shao, K.; Tang, Z.; Zhu, Y.; Li, N.; Zhao, D. A Survey of Deep Reinforcement Learning in Video Games. arXiv
**2019**, arXiv:1912.10944. [Google Scholar] - Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; A Bradford Book: Cambridge, MA, USA, 2018. [Google Scholar]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature
**2015**, 518, 529–533. [Google Scholar] [CrossRef] - Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing between Capsules. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), NIPS’17, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 3859–3869. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Liu, L.; Jiang, H.; He, P.; Chen, W.; Liu, X.; Gao, J.; Han, J. On the Variance of the Adaptive Learning Rate and Beyond. arXiv
**2020**, arXiv:1908.03265. [Google Scholar] - Tieleman, T.; Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn.
**2012**, 4, 26–31. [Google Scholar] - Richardson, M.; Domingos, P. Markov Logic Networks. Mach. Learn.
**2006**, 62, 107–136. [Google Scholar] [CrossRef][Green Version] - Navon, A.; Achituve, I.; Maron, H.; Chechik, G.; Fetaya, E. Auxiliary Learning by Implicit Differentiation. arXiv
**2020**, arXiv:2007.02693. [Google Scholar]

**Figure 1.**The five paradigms of Machine Learning according to the classification proposed by Pedro Domingos [16].

**Figure 4.**The generalization problem outside the range of the training values [42]. The authors train an autoencoder to take a scalar value as input, encode the value within its hidden layers, and reconstruct the input value as a linear combination of the last hidden layer. Each autoencoder is identical in its parameterization, tuning, and initialization, differing only in the choice of nonlinearity on hidden layers. For each point in the figure, the authors train 100 models to encode values between −5 and 5 and average their ability to encode values between −20 and 20.

**Figure 5.**Neural Arithmetic Logic Units with Bias (NALUB) architecture. The matrices of the parameters $\widehat{\mathbf{W}}$, $\widehat{\mathbf{M}}$, and $\mathbf{g}$ of the NALU model proposed in [42] are represented in the lower part of the figure. The matrix $tanh\left(\widehat{\mathbf{W}}\right)$ (where tanh denotes the hyperbolic tangent function) has all the values within the interval $[-1,1]$, whilst the matrix $\sigma \left(\widehat{\mathbf{M}}\right)$ (where $\sigma $ denotes the sigmoid function) is composed only of values within the interval $[0,1]$. Therefore, their elementwise product yields matrices with elements within the interval $[-1,1]$ and polarized towards the stable points $\{-1,0,1\}$. This allows the NALU architecture to make possible some arithmetic operations on the input vector $\mathbf{x}$. The introduction of a bias vector (highlighted in red in figure) further extends the set of possible operations. For instance, specific configurations of $\mathbf{W}$ and $\mathbf{g}$ enable the NALU model to obtain a component of the output vector $\mathbf{y}$ as a product of two components of the input vector $\mathbf{x}$. The introduction of the bias vector, hence, allows us to multiply, divide, add, and subtract these components with constants. The bias is added after having carried out the matrix product between the matrix $\mathbf{W}$ and the vector $\mathbf{x}$.

**Figure 7.**Neural Arithmetic Logic Units with Bias Tetration controlled Cell (NALUBTC) architecture. In the NALU architecture [42], the addition and subtraction operations are obtained by directly multiplying the input vector $\mathbf{x}$ by the transformation matrix $\mathbf{W}$ and the resulting vector is $\mathbf{y}$. To perform a multiplication or a division, the vector $\mathbf{x}$ is first transformed by applying the $log\left(x\right)$ operation to each component. This way, the sum of the logarithms corresponds to the product of the exponents as follows ${x}_{1}\ast {x}_{2}={e}^{(ln\left({x}_{1}\right)+ln\left({x}_{2}\right))}$. Therefore, $\mathbf{y}$ is obtained through the matrix product $\mathbf{W}\xb7\mathbf{x}$ raised to the exponent. In NALUBTC, we introduce the scalar parameter g (with $g\in \mathbb{R}$), which we use as the height of the tetration function (see Equation (2)). For the value $g=0$, the possible operations are addition and subtraction, for $g=1$, multiplication and division, for $g=2$, exponentiation and root. Hence, the set of possible operations is extended with respect to the NALU architecture, allowing for a more general automated parameter optimization.

**Figure 8.**Differentiable Neural Computer (DNC) architecture. The memory is a matrix accessed by content or by address through a reading key. The controller generates the read and write keys. The memory is accessed by the read and write heads, which apply different mechanisms such as attention and gating to determine the extent of the effect of the read/write action on the memory locations. The architecture learns to use memory to represent solutions to the problem. It learns to read and write the data needed to solve the problem. It is able to model, visit, and explore even complex structures, such as networks and graphs, saving the relationships between different inputs in memory. The controller is shown on the left of the figure, along with the input and output of the DNC architecture. The read and write heads are shown on the right of the controller. In the center the memory is represented, the colored lines are the locations accessed. Finally, on the right are represented first the links and then the usage vector, which indicates the locations used recently.

**Figure 9.**Differentiable Neural Computer (DNC) grafted model. Compared to the previous figure (i.e., Figure 8) only the controller part has been modified. In the original architecture, the controller is a simple neural network or a recurrent neural network. Differently, in this architecture the controller is itself a DNC.

**Figure 10.**A diagram showing the high-level functioning of the routing-based Reinforcement Learning architecture implemented on the Deep-Q Network (DQN) model. The first figure on the left shows a frame of the game, cropped and pre-processed. A convolutional network takes the game frame as input and generates the capsules of the first level. Those capsules pass through the capsule network to the final layer: the action capsules. The activated capsule is forwarded to the game environment to select the next action and to the inverse convolutional network. Starting from the activated capsule, the inverse convolutional network reconstructs the prediction of the transaction following the chosen action. The last figure on the right represents a transaction between one frame and the next. The correct figure is obtained by subtracting the game frame reached following an action and the previous frame.

**Figure 11.**(

**a**) A graph showing the duration of the experiments in steps for the CartPole-v1 environment during the training for the first experiment on routing applied to Reinforcement Learning. (

**b**) On the top left the current transaction, in the center the prediction of the transition, and on the right the four capsules of the first layer each consisting of four values. The first row displays the current transaction, the second row represents the previous transaction. The numbers above the figure in the last column on the right depend on our choice to adopt the mini-batch normalization as an optimization method. Consequently, the extraction of game frames for training the model occurs randomly. Specifically, the first number (on the left) is the index (assigned randomly) of the transition within the batch. The second number (on the right) denotes the game frame at which the reconstruction is performed. The direction of the arrow connecting the two numbers indicates whether the action chosen by the model is left or right, these being the only two possible actions in the CartPole-v1 game.

**Figure 12.**A different representation for transactions. (

**b**) On the top left the current transaction, in the center the prediction of the transition, and on the right the four capsules of the first layer each consisting of four values. The first row displays the current transaction, the second row represents the previous transaction. The numbers above the figure in the last column on the right depend on our choice to adopt the mini-batch normalization as an optimization method. Consequently, the extraction of game frames for training the model occurs randomly. Specifically, the first number (on the left) is the index (assigned randomly) of the transition within the batch. The second number (on the right) denotes the game frame at which the reconstruction is performed. The direction of the arrow connecting the two numbers indicates whether the action chosen by the model is left or right, these being the only two possible actions in the CartPole-v1 game.

**Figure 13.**A diagram showing the high-level functioning of the first Reinforcement Learning architecture with routing implemented on the Advantage Actor-Critic (A2C) algorithm. On the left, a frame of the game with below a four-component vector that encodes the current state of the game. A neural network takes the current state as input and generates the capsules of the first level. Those capsules pass through the capsule network to the final layer: the actor-critic capsules. The A2C algorithm as well as to the inverse convolutional network are applied to the last capsule . Starting from the actor-critic capsule, the inverse convolutional network reconstructs the prediction of the transaction following the chosen action. The transaction is defined as the difference between the vector of the current game state and the previous one.

**Figure 14.**A diagram showing the high-level functioning of the second Reinforcement Learning architecture with routing implemented on the Advantage Actor-Critic (A2C) algorithm. This model is similar to the previous one, shown in Figure 13. However, in this case, we have introduced an additional neural network (indicated as Actor in the figure), which returns the chosen action starting from the capsule of the final layer.

**Figure 15.**The experimental path we followed, along with some empirical evaluations of the tested solutions in terms of Interpretability, Efficiency, Structural Invariance, Scale Invariance, and Scalability.

**Table 1.**Details of the empirical tests performed on three Atari 2600 games from the Arcade Learning Environment [61].

Game | Algorithm | Vision | Capsule Routing | Optimization Algorithm | Learning Rate | Number of Episodes | Solved |
---|---|---|---|---|---|---|---|

CartPole-v0 | Actor-Critic (AC) | No | No | Adam | 3.00 $\times {10}^{-2}$ | 710 | Yes |

CartPole-v0 | AC | No | Policy | RAdam | 5.00 $\times {10}^{-3}$ | 600 | No |

CartPole-v0 | AC | No | Policy | RAdam | 5.00 $\times {10}^{-4}$ | 600 | No |

CartPole-v0 | AC | No | Policy | Adam | 5.00 $\times {10}^{-3}$ | 600 | No |

CartPole-v0 | AC | No | Policy | Adam | 1.30 $\times {10}^{-3}$ | 1420 | Yes |

CartPole-v0 | AC | No | Policy | Adam | 2.30 $\times {10}^{-3}$ | 1000 | Yes |

CartPole-v0 | AC | No | Policy | Adam | 2.00 $\times {10}^{-3}$ | 710 | Yes |

CartPole-v0 | AC with corrections | No | Policy | Adam | 2.00 $\times {10}^{-3}$ | 710 | Yes |

CartPole-v0 | AC with memory | No | No | Adam | 3.00 $\times {10}^{-2}$ | 1000 | Yes |

CartPole-v0 | AC with memory | No | No | Adam | 1.30 $\times {10}^{-3}$ | 820 | Yes |

CartPole-v0 | AC | Yes | Vision | RAdam | 5.00 $\times {10}^{-4}$ | 1000 | No |

CartPole-v0 | AC | Yes | Vision | RAdam | 1.30 $\times {10}^{-4}$ | 1200 | No |

CartPole-v0 | AC | Yes | Vision | RAdam | 1.30 $\times {10}^{-3}$ | 1000 | No |

CartPole-v0 | AC without policy | No | No | Adam | 3.00 $\times {10}^{-2}$ | 2220 | No |

CartPole-v0 | AC without policy | No | No | Adam | 3.00 $\times {10}^{-3}$ | 1190 | Yes |

CartPole-v0 | AC without policy | No | No | Adam | 5.00 $\times {10}^{-3}$ | 1000 | Yes |

CartPole-v0 | AC without policy | No | No | Adam | 7.00 $\times {10}^{-3}$ | 550 | Yes |

CartPole-v0 | AC | Yes | No | Adam | 3.00 $\times {10}^{-2}$ | 1200 | No |

CartPole-v0 | AC | Yes | No | RAdam | 5.00 $\times {10}^{-3}$ | 3090 | No |

CartPole-v0 | AC | Yes | No | RAdam | 1.30 $\times {10}^{-3}$ | 940 | No |

CartPole-v0 | AC | Yes | No | RAdam | 5.00 $\times {10}^{-5}$ | 1050 | No |

CartPole-v0 | AC | Yes | No | RAdam | 2.00 $\times {10}^{-4}$ | 920 | No |

CartPole-v0 | AC | Yes | No | RMSProp | 2.00 $\times {10}^{-4}$ | 1840 | No |

CartPole-v0 | AC | Yes | No | RMSProp | 5.00 $\times {10}^{-5}$ | 1910 | Yes |

CartPole-v0 | Deep-Q Network | Yes | Vision+Policy | RAdam | 5.00 $\times {10}^{-4}$ | 350 | No |

CartPole-v0 | Deep-Q Network | Yes | No | RMSProp | 0.01 | 1000 | No |

CartPole-v0 | Deep-Q Network | Yes | No | RMSProp | 0.01 | 1000 | No |

CartPole-v0 | Deep-Q Network | Yes | No | RMSProp | 0.01 | 10,000 | No |

Pong-v0 | AC | Yes | No | Adam | 2.00 $\times {10}^{-3}$ | 250 | No |

MountainCar-v0 | AC | No | No | Adam | 2.00 $\times {10}^{-3}$ | 14,000 | No |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Vaccaro, L.; Sansonetti, G.; Micarelli, A.
An Empirical Review of Automated Machine Learning. *Computers* **2021**, *10*, 11.
https://doi.org/10.3390/computers10010011

**AMA Style**

Vaccaro L, Sansonetti G, Micarelli A.
An Empirical Review of Automated Machine Learning. *Computers*. 2021; 10(1):11.
https://doi.org/10.3390/computers10010011

**Chicago/Turabian Style**

Vaccaro, Lorenzo, Giuseppe Sansonetti, and Alessandro Micarelli.
2021. "An Empirical Review of Automated Machine Learning" *Computers* 10, no. 1: 11.
https://doi.org/10.3390/computers10010011