Deep Reinforcement Learning Algorithms in Intelligent Infrastructure

: Intelligent infrastructure, including smart cities and intelligent buildings, must learn and adapt to the variable needs and requirements of users, owners and operators in order to be future proof and to provide a return on investment based on Operational Expenditure (OPEX) and Capital Expenditure (CAPEX). To address this challenge, this article presents a biological algorithm based on neural networks and deep reinforcement learning that enables infrastructure to be intelligent by making predictions about its di ﬀ erent variables. In addition, the proposed method makes decisions based on real time data. Intelligent infrastructure must be able to proactively monitor, protect and repair itself: this includes independent components and assets working the same way any autonomous biological organisms would. Neurons of artiﬁcial neural networks are associated with a prediction or decision layer based on a deep reinforcement learning algorithm that takes into consideration all of its previous learning. The proposed method was validated against an intelligent infrastructure dataset with outstanding results: the intelligent infrastructure was able to learn, predict and adapt to its variables, and components could make relevant decisions autonomously, emulating a living biological organism in which data ﬂow exhaustively.


Introduction
Artificial intelligence (AI) will enable infrastructure to be human in addition to only intelligent: AI will become the brain of the infrastructure, which will monitor, operate and manage its different assets, components and functions.This coordination and integration between simple parts and elements will enable infrastructure to reach a higher level of complexity similar to biological organisms.The intelligence of infrastructure, unlike in humans, will be decentralized to reduce single points of failure: it will be hosted in a distributed configuration between local edge servers and the external cloud in data centres.Electronic embedded devices will enable the sensorial abilities of infrastructure to feel environmental conditions, user occupancy, energy usage, material or components stress and asset status: this ability to feel, sense and interact with its users and environment will make infrastructure more human [1].Information or data will be transmitted between sensing devices and the infrastructure brain using a combination of wireless or wired methods and different transmission protocols based on the sensor network characteristics, such as the number of sensors, distance, bandwidth, power and data: technical radio frequency requirements such as path obstacle and channel interferences will also be considered.
There is already a gradual digitalization of infrastructure, or "smart concrete", that enables the monitoring of infrastructure to avoid its deterioration in order to make optimum investment decisions [2]: safe infrastructure requires data to evaluate its short and long term performance, where wireless sensor networks enable infrastructure monitoring without the Capital Expenditure (CAPEX) of cabling: this provides real time status data, therefore enabling proactive maintenance rather than a reactive approach, which is normally at higher cost.In addition to providing data connectivity, Wi-Fi and Bluetooth low energy (BLE) beacons can be used to collect infrastructure occupancy information [3]: these data are key for space management and assessments about usage.
The main component of digital infrastructure, or digital twin, is the Building Information Model (BIM).The BIM consists of two major components: a three dimensional graphical reproduction of the building geometry and a related database in which all data, properties and relations are stored.The value of the BIM generated during the design and construction phase is well documented and can result in an estimated 30% reduction in total construction costs [4]; however, once the building is handed over to its owners, the remaining applications of the BIM are not widely used by the building maintainers and operators.BIM technology adoption is focused on BIM human users and people interaction, rather than on technology, in order to achieve a successful change in management [5] and overcome the resistance of the construction and building industry against transformation (Figure 1).
Infrastructures 2019, 4, 52 2 of 24 of cabling: this provides real time status data, therefore enabling proactive maintenance rather than a reactive approach, which is normally at higher cost.In addition to providing data connectivity, Wi-Fi and Bluetooth low energy (BLE) beacons can be used to collect infrastructure occupancy information [3]: these data are key for space management and assessments about usage.
The main component of digital infrastructure, or digital twin, is the Building Information Model (BIM).The BIM consists of two major components: a three dimensional graphical reproduction of the building geometry and a related database in which all data, properties and relations are stored.The value of the BIM generated during the design and construction phase is well documented and can result in an estimated 30% reduction in total construction costs [4]; however, once the building is handed over to its owners, the remaining applications of the BIM are not widely used by the building maintainers and operators.BIM technology adoption is focused on BIM human users and people interaction, rather than on technology, in order to achieve a successful change in management [5] and overcome the resistance of the construction and building industry against transformation (Figure 1).The Internet of Things (IoT) is the extension of digital connectivity into the physical devices and components of infrastructure based on numerous protocols such as Message Queuing Telemetry Transport (MQTT), Constrained Application Protocol (CoAP), Representational State Transfer (REST) and JavaScript Object Notation (JSON), among many others (Figure 2).In the IoT, things are objects of the physical world (physical things) that can be sensed or objects of the information world (virtual things) that can be digitalized: both are capable of being identified, integrated into information and transmitted via the sensor wired or wireless communication networks [6].The IoT has made a significant contribution to building construction, operation and management by enabling data services, providing efficient functionalities and moving toward sustainable development goals [7]; however, the full adoption of the IoT in infrastructure still presents challenges such as the interoperability of protocols, data consistency and the amount of data to be collected, processed and stored.In addition, the integration of a BIM 3D virtual reality model with the real time uninterrupted collection of data from the IoT to control and manage infrastructure [8] provides innovative applications that improve construction and operational efficiencies.The integration methods between BIM tools' Application Program Interfaces (APIs), cloud computing and relational databases for real time applications were created by a query language that uses semantic web technologies service oriented architecture (SOA) patterns and web services based strategies [9].The Internet of Things (IoT) is the extension of digital connectivity into the physical devices and components of infrastructure based on numerous protocols such as Message Queuing Telemetry Transport (MQTT), Constrained Application Protocol (CoAP), Representational State Transfer (REST) and JavaScript Object Notation (JSON), among many others (Figure 2).In the IoT, things are objects of the physical world (physical things) that can be sensed or objects of the information world (virtual things) that can be digitalized: both are capable of being identified, integrated into information and transmitted via the sensor wired or wireless communication networks [6].The IoT has made a significant contribution to building construction, operation and management by enabling data services, providing efficient functionalities and moving toward sustainable development goals [7]; however, the full adoption of the IoT in infrastructure still presents challenges such as the interoperability of protocols, data consistency and the amount of data to be collected, processed and stored.In addition, the integration of a BIM 3D virtual reality model with the real time uninterrupted collection of data from the IoT to control and manage infrastructure [8] provides innovative applications that improve construction and operational efficiencies.The integration methods between BIM tools' Application Program Interfaces (APIs), cloud computing and relational databases for real time applications were created by a query language that uses semantic web technologies service oriented architecture (SOA) patterns and web services based strategies [9].
A virtual infrastructure represented in augmented reality (AR) will improve the user interaction with intelligent infrastructure for specific applications such as maintenance, training and wayfinding.AR combines 3D point cloud tracking for the accurate combination of indoor location and point of interest information [10] with a BIM dataset, where AR can access the BIM model to directly update its information.In addition to user interaction, intelligent infrastructure also focuses on providing quality of life and a healthy environment for its users: temperature and humidity variables such as air quality (CO 2 , chemicals and dust) are also being collected by sensors because poor air quality causes health issues and decreases productivity [11].Infrastructure and building energy consumption contribute greatly to humanity's global energy usage (50%) and CO 2 emissions (30%) [12].Smart grids and smart meters enable infrastructure users and owners to monitor energy usage in real time and shift demand according to energy price in order to get the biggest proportion of clean renewable energy [13]: intelligent infrastructure will be able to be an energy broker by storing and commercializing electricity in fleets of electric vehicles as the final energy distribution chain.
objects of the physical world (physical things) that can be sensed or objects of the information world (virtual things) that can be digitalized: both are capable of being identified, integrated into information and transmitted via the sensor wired or wireless communication networks [6].The IoT has made a significant contribution to building construction, operation and management by enabling data services, providing efficient functionalities and moving toward sustainable development goals [7]; however, the full adoption of the IoT in infrastructure still presents challenges such as the interoperability of protocols, data consistency and the amount of data to be collected, processed and stored.In addition, the integration of a BIM 3D virtual reality model with the real time uninterrupted collection of data from the IoT to control and manage infrastructure [8] provides innovative applications that improve construction and operational efficiencies.The integration methods between BIM tools' Application Program Interfaces (APIs), cloud computing and relational databases for real time applications were created by a query language that uses semantic web technologies service oriented architecture (SOA) patterns and web services based strategies [9].Although technology enables infrastructure to be smarter through the interconnection of users and the provision of services between systems, this open accessibility also increases cybersecurity risks (Figure 3).Securing the intelligent infrastructure cyberspace is key in order to protect safety, prosperity and economic growth [14]: critical infrastructure will be specifically protected, as potential attackers have a wide range of different motivations, such as hostile countries, political activism, ransom hacks, intellectual theft or disgruntled employee revenge.The main challenge that cybersecurity faces in intelligent infrastructure is its economic cost: despite its risk, cybersecurity is not considered a significant investment area because the cost of full implementation is prohibitive, therefore negating an exhaustive business case, which leads to reactive security rather than proactive security [15].A virtual infrastructure represented in augmented reality (AR) will improve the user interaction with intelligent infrastructure for specific applications such as maintenance, training and wayfinding.AR combines 3D point cloud tracking for the accurate combination of indoor location and point of interest information [10] with a BIM dataset, where AR can access the BIM model to directly update its information.In addition to user interaction, intelligent infrastructure also focuses on providing quality of life and a healthy environment for its users: temperature and humidity variables such as air quality (CO2, chemicals and dust) are also being collected by sensors because poor air quality causes health issues and decreases productivity [11].Infrastructure and building energy consumption contribute greatly to humanity's global energy usage (50%) and CO2 emissions (30%) [12].Smart grids and smart meters enable infrastructure users and owners to monitor energy usage in real time and shift demand according to energy price in order to get the biggest proportion of clean renewable energy [13]: intelligent infrastructure will be able to be an energy broker by storing and commercializing electricity in fleets of electric vehicles as the final energy distribution chain.
Although technology enables infrastructure to be smarter through the interconnection of users and the provision of services between systems, this open accessibility also increases cybersecurity risks (Figure 3).Securing the intelligent infrastructure cyberspace is key in order to protect safety, prosperity and economic growth [14]: critical infrastructure will be specifically protected, as potential attackers have a wide range of different motivations, such as hostile countries, political activism, ransom hacks, intellectual theft or disgruntled employee revenge.The main challenge that cybersecurity faces in intelligent infrastructure is its economic cost: despite its risk, cybersecurity is not considered a significant investment area because the cost of full implementation is prohibitive, therefore negating an exhaustive business case, which leads to reactive security rather than proactive security [15].Blockchain technology (Figure 4) can also increase cybersecurity resilience in order to mitigate threats to the availability, integrity, confidentiality, authenticity and accountability of intelligent infrastructure [16].Blockchain applications can be applied to humans, technology and businesses: they provide privacy, integrity and data confidentiality in transactions via a distributed structure [17] without the need for an intermediary authoriser.Blockchain services for intelligent infrastructure range from data validation, smart meter readings and payments, user authentication and asset integrity [18] to traditional payments, smart contracts and digital transactions of information or money [19].Blockchain technology (Figure 4) can also increase cybersecurity resilience in order to mitigate threats to the availability, integrity, confidentiality, authenticity and accountability of intelligent infrastructure [16].Blockchain applications can be applied to humans, technology and businesses: they provide privacy, integrity and data confidentiality in transactions via a distributed structure [17] without the need for an intermediary authoriser.Blockchain services for intelligent infrastructure range from data validation, smart meter readings and payments, user authentication and asset integrity [18] to traditional payments, smart contracts and digital transactions of information or money [19].
The intelligence of infrastructure will be founded on artificial intelligence (AI), emulating the way biological organisms learn from experiences, adapt to the external environment, transmit information and finally evolve across mutations.AI is based on the brain structure and neural configuration: the brain acquires a large amount of information obtained from the senses, analyses and processes the data via different learned functions and finally makes judgments and takes decisions, where the clusters of neuron specialization occurs as a result of their adaption to learning tasks [20].
Intelligent algorithms can be divided into three main classes: Supervised learning finds a function matching given input-output pairs, and this method requires a training set.Instead, in unsupervised learning, only inputs are given, and the cost function to be minimized depends on the task to be modelled and a priori assumptions such as implicit properties of the model, its parameters or the observed variables.Finally, inputs are usually not given in reinforcement learning, as they are generated by an agent's interactions with the environment: at each point in time, the agent performs an action and the environment generates an observation with an instantaneous associated cost according to some dynamics (Figure 5).The intelligence of infrastructure will be founded on artificial intelligence (AI), emulating the way biological organisms learn from experiences, adapt to the external environment, transmit information and finally evolve across mutations.AI is based on the brain structure and neural configuration: the brain acquires a large amount of information obtained from the senses, analyses and processes the data via different learned functions and finally makes judgments and takes decisions, where the clusters of neuron specialization occurs as a result of their adaption to learning tasks [20].
Intelligent algorithms can be divided into three main classes: Supervised learning finds a function matching given input-output pairs, and this method requires a training set.Instead, in unsupervised learning, only inputs are given, and the cost function to be minimized depends on the task to be modelled and a priori assumptions such as implicit properties of the model, its parameters or the observed variables.Finally, inputs are usually not given in reinforcement learning, as they are generated by an agent's interactions with the environment: at each point in time, the agent performs an action and the environment generates an observation with an instantaneous associated cost according to some dynamics (Figure 5).Different AI learning methods have diverse learning and computational properties: therefore, they are applied to different models.Reinforcement learning is used for fast and quick decisions in unsupervised scenarios; deep learning clusters based on gradient descent are most suited for identity and memory, although they are computationally expensive; and finally, genetic algorithms transmit information to future generations [21].In addition, artificial intelligence has been used to make management decisions in complex management structures [22].

Article Proposal
This article proposes the implementation of artificial intelligence in intelligent infrastructure with an extensive literature review that also covers the building information model, the Internet of Things and deep reinforcement learning.In addition, this article also presents a deep reinforcement learning model based on a random neural network that updates the neural network weights, taking   The intelligence of infrastructure will be founded on artificial intelligence (AI), emulating the way biological organisms learn from experiences, adapt to the external environment, transmit information and finally evolve across mutations.AI is based on the brain structure and neural configuration: the brain acquires a large amount of information obtained from the senses, analyses and processes the data via different learned functions and finally makes judgments and takes decisions, where the clusters of neuron specialization occurs as a result of their adaption to learning tasks [20].
Intelligent algorithms can be divided into three main classes: Supervised learning finds a function matching given input-output pairs, and this method requires a training set.Instead, in unsupervised learning, only inputs are given, and the cost function to be minimized depends on the task to be modelled and a priori assumptions such as implicit properties of the model, its parameters or the observed variables.Finally, inputs are usually not given in reinforcement learning, as they are generated by an agent's interactions with the environment: at each point in time, the agent performs an action and the environment generates an observation with an instantaneous associated cost according to some dynamics (Figure 5).Different AI learning methods have diverse learning and computational properties: therefore, they are applied to different models.Reinforcement learning is used for fast and quick decisions in unsupervised scenarios; deep learning clusters based on gradient descent are most suited for identity and memory, although they are computationally expensive; and finally, genetic algorithms transmit information to future generations [21].In addition, artificial intelligence has been used to make management decisions in complex management structures [22].

Article Proposal
This article proposes the implementation of artificial intelligence in intelligent infrastructure with an extensive literature review that also covers the building information model, the Internet of Things and deep reinforcement learning.In addition, this article also presents a deep reinforcement learning model based on a random neural network that updates the neural network weights, taking into consideration its entire previous learning, rather than just the latest information, therefore including time and memory.The proposed method was validated in a public research intelligent building dataset with successful results.Different AI learning methods have diverse learning and computational properties: therefore, they are applied to different models.Reinforcement learning is used for fast and quick decisions in unsupervised scenarios; deep learning clusters based on gradient descent are most suited for identity and memory, although they are computationally expensive; and finally, genetic algorithms transmit information to future generations [21].In addition, artificial intelligence has been used to make management decisions in complex management structures [22].

Article Proposal
This article proposes the implementation of artificial intelligence in intelligent infrastructure with an extensive literature review that also covers the building information model, the Internet of Things and deep reinforcement learning.In addition, this article also presents a deep reinforcement learning model based on a random neural network that updates the neural network weights, taking into consideration its entire previous learning, rather than just the latest information, therefore including time and memory.The proposed method was validated in a public research intelligent building dataset with successful results.

Article Structure
Section 2 of this article presents a research background that consists of artificial intelligence in infrastructure, Data and the Building Information Model, the Internet of Things with related cybersecurity and deep reinforcement learning.Section 3 presents a mathematical definition of the deep reinforcement learning model, whereas Section 4 includes a proposed method of intelligent infrastructure.The validation of the proposed method is shown in Section 5. Finally, a discussion and conclusions are shared in Sections 6 and 7, respectively.

Artificial Intelligence in Infrastructure
Several machine learning regression methods that develop a predictive model have been examined and applied to predict the hourly full load of the electrical power output of a combined cycle power plant [23].Its base load operation is influenced by four main parameters: ambient temperature, atmospheric pressure, relative humidity and exhaust steam pressure, and these parameters are used as input variables in the dataset that affect the electrical power output, which is considered the target variable.The usage prediction of building energy has an important role in building energy management and conservation, as it assists in the evaluation of the building energy efficiency, the delivery of building commissioning and the detection and diagnosis of building system faults [24]: an artificial intelligence based approach for building energy uses prediction algorithms based on historical data and methods such as multiple linear regression, artificial neural networks and support vector regression.A statistical machine learning framework studies the effects of eight input variables (relative compactness, surface area, wall area, roof area, overall height, orientation, glazing area, glazing area distribution) on two output variables: the heating load and cooling load of residential buildings based on a classical linear regression approach [25] (the model is compared to a state of the art nonlinear nonparametric method based on random forests).
The forecast of energy consumption in homes is an important activity for the smart grid, and this prediction is very dependent on inhabitants' behaviour [26]: a stochastic prediction method segments data based on patterns in energy consumption, aggregating it via the k means clustering algorithm.Data driven predictive models for the energy use of appliances include measurements of temperature and humidity sensors from a wireless network, weather from a nearby airport station and energy use [27]: data are filtered to remove nonpredictive parameters and enable feature ranking, where four statistical models are trained and evaluated with repeated cross validation (multiple linear regression, support vector machine with radial kernel, random forest and gradient boosting machines).Machine learning and artificial intelligence have been used in emergency navigation in a cloud environment to reduce device energy consumption [28] (without static ad hoc networks such as wireless sensor infrastructure) [29].

Infrastructure, Data and the Building Information Model
The building information model (BIM) was developed for effective collaboration between design and construction project participants through the building life cycle; however, in order for the BIM to be effective, it also requires additional layers of integration on a functional level (BIM modeler and checker), information management (BIM server) and process support (BIM collaboration) [30].Despite its major technical advantages, the BIM has not been fully adopted, and its definitive benefits have not been fully capitalized by industry stakeholders during the construction and handover stage: this appears to be linked to risks and challenges such as intellectual property, user skills, model reuse and cybersecurity, which are potentially limiting its effectiveness [31].In addition, the integration of BIM also blurs the level of responsibility between different team members and data and design ownership.Empirical insights into the implementation and collaborative nature of BIM construction projects are divided into (1) Information Technology capacity, (2) technology management, (3) attitude and behaviour, (4) role taking, (5) trust, (6) communication, (7) leadership and (8) learning and experience [32], where the taxonomy of BIM affects three dimensions: technology, people and process.
BIM applications in green infrastructure, or the green BIM triangle, follow specific project phases from design, construction to operation and renovation or demolition); green attribute analysis such as energy consumption, emissions, lighting or ventilation, material and waste; and finally, BIM attributes in supporting green building assessments, which includes database integration, document management, analysis and simulation and visualization [33].Traditional risk management strategies can be complemented with BIM technology to manage hazards such as automatic rule checking, knowledge based systems and reactive proactive IP based safety systems [34]: BIM could not only be utilized to support project development processes as a systematic risk management tool, but it could also provide the core data generator and platform to allow other BIM based tools to perform further risk analysis (although due to existing technical limitations and the lack of "human factor" testing, BIM based risk management has not been commonly used in real environments).A BIM model can be integrated with other IoT technologies to improve user interaction and interfaces [35]: examples include an ultrawide band (UWB) based indoor positioning system and an inertial measurement unit (IMU) that retrieves user contextual information with respect to the built environment for the control of electric appliances in a smart home.The BIM can also be applied with other manufacturing and production techniques and reverse engineering [36], which include 3D laser scanning, virtual reality, 3D printing and prefabrication for a better understanding of the design and construction process and for tools with enhanced organization and management quality that reduce defects and reworks.
There are also additions and extensions to the traditional BIM model: BIM itself is a purpose built, product centric information database that lacks domain semantics.An ontology based semantic approach that analyses construction workface planning is focused on extracting quantity information from a BIM design model.This method allows user semantic queries using a domain vocabulary that exploits the building product ontology formalized from construction perspectives [37]: as such, information relevant to construction practitioners can be readily extracted and visualized in 3D in order to serve application needs in the construction field.
A 4D BIM is defined as a 3D model that includes time, and it can be used as a framework to automatically analyse, generate and visualize the evacuation paths of multiple teams considering construction activities and site conditions of the specific project schedule [38]: the prototype enables users to define parameters for pathfinding, such as workspaces, material storage areas and temporary structures, to automatically identify the accessible evacuation paths.A 6D BIM model that also included time, a cost schedule and CO 2 emissions calculations was applied in a railway station, Kings Cross London [39]: the model provided an effective plan and design that adjusted to the economic and environmental framework and requirements of a construction project while operating in tandem with maintenance and uninterrupted railway operations.

The Internet of Things and Cybersecurity
The IoT enables comprehensive connectivity between devices; however, this benefit also intrinsically increases cybersecurity risks, as cyber attackers are provided with expanded network access and additional digital targets.The evolution of IoT technology started with "machine to machine" through the connection of machines and devices, and then included "Interconnections of Things" that connect any physical or virtual object and finally a "Web of Things" that enables collaboration between people and objects [40].The fast development of the IoT had the consequence that it was initially designed without appropriate consideration of the security challenges involved [41]: various vulnerabilities have been detected that will keep the IoT as a technology with risks, and as a result, numerous attacks on the IoT were invented before its actual commercial implementation [42].As security will be a fundamental enabling factor of most IoT applications, mechanisms must also be designed to protect communications enabled by such technologies [43].The IoT is formed of three layers (sensor, transportation and application) that are similar to traditional networks with equivalent security issues and integration challenges [44].Because physical, virtual and user private information is captured, transmitted and shared by the IoT sensors [45], the enforcement of security and privacy policies will also consider and implement the cybersecurity aspects of data confidentiality and authentication, access control within the IoT network, identity management, privacy and trust between users and things.The dynamic IoT is formed by heterogeneous technologies that provide innovative services in various application domains, which will meet flexible security and privacy requirements [46]: traditional security countermeasures cannot be directly applied due to the different standards, communication protocols and scalability issues related to and as a consequence of the high number of interconnected devices.An important challenge for supporting diverse multimedia applications in the IoT is the security heterogeneity of wired and wireless sensor and transmission networks, which requires a balance between flexibility and efficiency [47].A secure and safe Internet of Things (SerIoT) was proposed to improve the information and physical security of different operational IoT application platforms in a holistic and cross layered manner [48]: the SerIoT covers areas such as mobile telephony, networked health systems, the Internet of Things, smart cities, smart transportation systems, supply chains and industrial informatics [49].
The IoT enables the integration of intelligent behaviour and services into the surrounding environments, such as infrastructure and buildings [50]: a smart building management system based on knowledge databases, machine learning, big data engines and data storage extracts information from the building, its users or managers; adapts to the real environment; and finally takes action on building systems applying previously learned strategies.The commoditization of smart building technology based on the IoT will redefine the way we work and live in the future.The promise of intelligent infrastructure extends far beyond energy efficiency or house comfort services, and the IoT will enable radical changes similar to the ones brought by the internet [51]: cloud integration is democratizing the IoT in intelligent infrastructure to include more complex functionality at a reduced cost (however, it also provides additional issues, such as cybersecurity and privacy, that will be addressed).An IoT platform based implementation for design automatization in smart building systems reuses hardware and software on shared infrastructure to optimize design performance [52]: the methodology consists of a functional design layer with virtual device platforms, function templates and virtual devices; a module design layer with module platforms, virtual device templates, sensing modules and data analytics modules; and finally an implementation platform with building operation systems APIs and programme code run time.Digitalization will merge the intelligence of infrastructure, buildings and transport systems because technology and solutions based on IoT and AI are shared [53] to cover similar functionalities such as route optimization, parking management, accident detection or fare collection.

Deep Reinforcement Learning
Deep learning enables reinforcement learning to scale decision making solutions that were previously unmanageable.A new algorithm called double deep Q network (DQN) generalizes an arbitrary function approximation [54]: the algorithm includes deep neural networks and reduces overestimations by decomposing the max operation in the target into action selection and action evaluation.Although the DQN solves problems with high dimensional observation spaces, it can only manage discrete and low dimensional action spaces.The DQN depends on finding the action that maximizes the cost function, which in the continuous valued case requires an iterative optimization process at each step [55]: in order to overcome this issue, an algorithm based on the deterministic policy gradient operates over continuous spaces.
A framework for deep reinforcement learning asynchronously executes multiple agents in parallel on multiple instances of the environment [56]: this parallelism decorrelates the agent's data into a more stationary process based on gradient descent that optimizes deep neural network controllers.A neural network architecture for the modelling of free reinforcement learning consists of a dual network that represents two separate estimators, one for the state value function and the other for the state dependent action advantage function [57]: the two streams are combined via a special aggregating layer to produce an estimate of the state action from the value function.Continuous simple actions, high state and action dimensionality control, tasks with partial observations and tasks with hierarchical structures are benchmarked [58].Challenges posed by reproducibility, experimental techniques and reporting procedures on deep reinforcement learning methods [59] have been presented in reported metrics, where results are compared to common baselines to suggest guidelines to make future results more reproducible.Deep reinforcement learning has also been applied in resource management problems for systems and networking [60]: the decision making tasks where appropriate solutions are taken depend on understanding the workload and environment experience.
The end to end learning of communications protocols in complex environments presents issues on partial observability, where multiple agents sense and act with the goal of maximizing their shared utility: these drawbacks are addressed by two approaches based on centralized learning but decentralized execution [61]: reinforced interagent learning applies deep Q learning, and differentiable interagent learning exploits the fact that, during learning, agents can back propagate error derivatives through noisy communication channels.Imagination augmented agents (I2As) is a novel architecture for deep reinforcement learning that combines model free and model based aspects [62]: model based reinforcement learning and planning methods prescribe how a model should be used to arrive at a policy, whereas I2As learns to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways by using the predictions as additional context in deep policy networks.
The useful interaction between sophisticated reinforcement learning (RL) systems and real world environments requires the communication of complex goals in these systems.Goals can be defined in terms of nonexpert human preferences between pairs of trajectory segments to solve complex RL tasks that do not require access to the reward function [63]: the model can successfully train complex novel behaviours with one hour of validation time, which reduces the cost of human supervision.Learning goal directed behaviour in environments with sparse feedback is a major challenge for reinforcement learning algorithms, and one of the key difficulties is insufficient exploration that results in an agent being unable to learn robust policies: intrinsically motivated agents can explore new behaviour for their own sake rather than directly solving external goals and tasks posed by the environment.A hierarchical DQN (h DQN) is a framework that integrates hierarchical action value functions that operate at different temporal scales with intrinsically motivated goal driven based deep reinforcement learning [64]: a top level q value function learns a policy over intrinsic goals, while a lower level function learns a policy over atomic actions to satisfy flexible goal specifications such as functions over entities and relations.
To use reinforcement learning successfully in situations close to real world complexity, agents must derive efficient representations of the environment from high dimensional sensory inputs and use these to transform past experiences into typical models that can be adapted to new situations [65]: humans and other animals solve this problem through a combination of reinforcement learning and hierarchical sensory processing systems.A deep Q network can successfully learn policies directly from high dimensional sensory inputs using end to end reinforcement learning.A learning energy based policy for continuous states and actions applies a soft Q learning method based on learning maximum entropy policies that expresses the optimal policy via a Boltzmann distribution [66]: the amortized Stein variational gradient descent learns a stochastic sampling network that approximates samples from this distribution in order to improve the exploration and composition that allow for transferring skills between tasks with a connection to actor critic methods.
A framework for autonomous driving that uses deep reinforcement learning is a challenging model as a supervised learning problem due to its complex interactions with the environment, including other vehicles, pedestrians and roadworks [67]: the model incorporates recurrent neural networks for information integration in order to enable vehicles to handle partially observable scenarios, and it also integrates recent work on attention models that focused on relevant information to reduce the computational complexity for deployment in embedded hardware.There have been six extensions to the DQN algorithm (double Q learning, prioritized replay, duelling networks, multistep learning, distributional reinforcement learning and noisy sets); however, it is unclear which of these extensions are complementary and can be successfully combined [68], and an integrated agent called rainbow combines them to assess incremental performance with several experiments.Two less often addressed issues of deep reinforcement learning are the lack of a generalization capability for new target goals and data inefficiency, as it requires several and costly iterations of trial and error to converge, which makes real world scenario applications impractical [69].To address the first issue, an actor critic model whose policy is a function of the goal as well as the current state allows for a more efficient generalization process.To address the second issue, a framework provides an environment with high quality 3D scenes and a physics engine that enables agents to take actions and interact with objects in order to collect efficiently a huge number of end to end trainable samples that do not need feature engineering, feature matching between frames or 3D reconstruction of the environment.An active detection and class specific model locates objects in scenes that enable an agent to focus attention on candidate regions for identifying the correct location of a target object [70]: the agent learns to deform a bounding box using simple transformation actions with the goal of determining the most specific location of target objects, which follows top-down reasoning and deep reinforcement learning.
Deep reinforcement learning enables autonomous robots to learn large collections of behavioural skills with minimal human intervention; however, robotic applications of reinforcement learning often compromise the autonomy of the learning process in favour of achieving training times that are practical for real physical systems [71], and a deep reinforcement learning algorithm based on off policy training of deep Q functions can scale to complex 3D manipulation tasks that can learn deep neural network policies efficiently enough to train real physical robots.Typically, deep reinforcement learning methods only utilize visual input for training, although an innovative method augments these models to exploit 3D feature information during the training phase involving partially observable states [72]: the model is trained to simultaneously learn these features, which minimise a Q learning objective in order to improve the training speed and performance of the agent.

Deep Reinforcement Learning Model
In a reinforcement learning model [73], agents interact with the environment via observations (O) and actions (A).At each interaction step t, the agent receives as input some indication of the current state s t of the environment.Then the agent selects an action a t with a probability p t of generating output.This action changes the state s t of the environment to s t+1 , and the value of this state evolution is transmitted to the agent through a scalar reinforcement signal or reward r t+1 associated with the transition (s t , a t , s t+1 ).The agent chooses actions over time that tend to increase the long term sum of values of the reinforcement signal by trial and error guided by a reinforcement algorithm.Basic reinforcement learning is modelled as a Markov decision process (Figure 6): The deep reinforcement learning algorithm presented in this section consists of a random neural network (RNN) [74][75][76] with at least as many nodes as the number of decisions to be taken: the network is generated where neurons are numbered 1, . . ., j, . . ., n, and therefore for any decision i, there is some neuron i. Decisions in this RL algorithm with an RNN are made by selecting the decision j for which the corresponding neuron is the most excited, the one that has the largest value of q j .The state q j is the probability that it is excited, and these quantities satisfy the following system of nonlinear equations: (O) and actions (A).At each interaction step t, the agent receives as input some indication of the current state st of the environment.Then the agent selects an action at with a probability pt of generating output.This action changes the state st of the environment to st+1, and the value of this state evolution is transmitted to the agent through a scalar reinforcement signal or reward rt+1 associated with the transition (st, at, st+1).The agent chooses actions over time that tend to increase the long term sum of values of the reinforcement signal by trial and error guided by a reinforcement algorithm.Basic reinforcement learning is modelled as a Markov decision process (Figure 6): In the deep reinforcement learning model, neurons make decisions if the next reward trend is upwards, downwards or equal (Figure 7).In addition, as defined at the end of this section, the proposed model also includes a predictor neuron to forecast the value of future rewards.
Infrastructures 2019, 4, 52 10 of 24 The deep reinforcement learning algorithm presented in this section consists of a random neural network (RNN) [74][75][76] with at least as many nodes as the number of decisions to be taken: the network is generated where neurons are numbered 1, …, j, …, n, and therefore for any decision i, there is some neuron i. Decisions in this RL algorithm with an RNN are made by selecting the decision j for which the corresponding neuron is the most excited, the one that has the largest value of qj.The state qj is the probability that it is excited, and these quantities satisfy the following system of nonlinear equations: () (, ) + () .

𝜆 (𝑗) = 𝑞 𝑟(𝑖)𝑝 (𝑖, 𝑗) + 𝜆(𝑗)
(1) In the deep reinforcement learning model, neurons make decisions if the next reward trend is upwards, downwards or equal (Figure 7).In addition, as defined at the end of this section, the proposed model also includes a predictor neuron to forecast the value of future rewards.In altering the reinforcement learning algorithm described in the cognitive packet network [77][78][79][80][81], given that some goal G that the agent has to achieve is a function to be optimized and that reward R is a consequence of interaction with the environment, successive measured values of R are denoted by Rl, l = 1, 2, …, and these are used to compute a decision threshold Tl: where α represents the threshold memory, and 0 < α < 1 and β represent the learning gradient.Both variables can be statically assigned or dynamically updated based on external observations.The agent takes the lth decision, which corresponds to neuron j, and then the lth reward Rl is measured and its associated Tl-1 is calculated, where the network weights are updated as follows for all neurons i ≠ j.The key innovation of the deep reinforcement learning algorithm presented in this section is that it includes time, or memory, when updating the network weights: it is based on all previous values rather than the previous state (Figure 8).In altering the reinforcement learning algorithm described in the cognitive packet network [77][78][79][80][81], given that some goal G that the agent has to achieve is a function to be optimized and that reward R is a consequence of interaction with the environment, successive measured values of R are denoted by R l , l = 1, 2, . . ., and these are used to compute a decision threshold T l : where α represents the threshold memory, and 0 < α < 1 and β represent the learning gradient.Both variables can be statically assigned or dynamically updated based on external observations.The agent takes the lth decision, which corresponds to neuron j, and then the lth reward R l is measured and its associated T l-1 is calculated, where the network weights are updated as follows for all neurons i j.The key innovation of the deep reinforcement learning algorithm presented in this section is that it includes time, or memory, when updating the network weights: it is based on all previous values rather than the previous state (Figure 8).

Iteration t Iteration t+1
Iteration t+2 Iteration t-1 Iteration t-2 The deep reinforcement algorithm rewards the network weights if the trend decision is correct, where R1 > 0 and j = 0 for upwards, R1 < 0 and j = 1 for downwards or R1 = 0 and j = 2 for equal: Otherwise, it penalises the networks weights by where δt is a variable weighting factor that depends on t, 0 < δt < 1, and l is the stage decision.In the above equations, w + ij is the rate at which neuron i transmits excitation spikes to neuron j, and w -ij is the rate at which neuron i transmits inhibitory spikes to neuron j in both situations when neuron i is excited.Λi and λi are the rates of the external excitatory and inhibitory signals, respectively.In addition to the reinforcement learning algorithm for decision trends, the deep reinforcement algorithm makes predictions on the future values of the manager neurons qt+1 based on the previous predictions qt and current measurement qc: where γt is a variable weighting factor that depends on t, 0 < γt < 1, that can be statically or dynamically assigned, representing the prediction memory.

Deep Reinforcement Learning in Intelligent Infrastructure
The intelligent infrastructure model consists of a layer of sensor neurons that takes infrastructure measurements for specific variables such as temperature or humidity related to a precise area or floor.The u sensor neurons are connected to their respective management sensor, which averages their value.In addition, a neural management layer that makes predictions about the sensor network values and trends is based on the presented deep reinforcement learning (Figure 9).The deep reinforcement algorithm rewards the network weights if the trend decision is correct, where R 1 > 0 and j = 0 for upwards, R 1 < 0 and j = 1 for downwards or R 1 = 0 and j = 2 for equal: Otherwise, it penalises the networks weights by where δ t is a variable weighting factor that depends on t, 0 < δ t < 1, and l is the stage decision.In the above equations, w + ij is the rate at which neuron i transmits excitation spikes to neuron j, and w - ij is the rate at which neuron i transmits inhibitory spikes to neuron j in both situations when neuron i is excited.Λ i and λ i are the rates of the external excitatory and inhibitory signals, respectively.In addition to the reinforcement learning algorithm for decision trends, the deep reinforcement algorithm makes predictions on the future values of the manager neurons q t+1 based on the previous predictions q t and current measurement q c : where γ t is a variable weighting factor that depends on t, 0 < γ t < 1, that can be statically or dynamically assigned, representing the prediction memory.

Deep Reinforcement Learning in Intelligent Infrastructure
The intelligent infrastructure model consists of a layer of sensor neurons that takes infrastructure measurements for specific variables such as temperature or humidity related to a precise area or floor.The u sensor neurons are connected to their respective management sensor, which averages their value.In addition, a neural management layer that makes predictions about the sensor network values and trends is based on the presented deep reinforcement learning (Figure 9).The intelligent infrastructure presented in this article uses deep reinforcement learning to make three trend decisions with three associated neurons and an independent neuron that makes predictions of the values of the manager neuron qm: • q0 predicts that the trend of the manager neuron qm is to go upwards or up; • q1 predicts that the trend of the manager neuron qm is downwards or down; • q2 predicts that the trend of the manager neuron qm is to keep its value or be equal; • qt+1 predicts the value of the manager neuron qm.

Deep Reinforcement Learning in Intelligent Infrastructure: Validation and Results
The intelligent infrastructure deep reinforcement algorithm was validated with a research dataset [27] (https://github.com/iamrishab/Data-driven-prediction-models-of-energy-use-of-appliances-in-a-low-energy-house)based on a house with electric metering with Meter-Bus energy counters that measure the energy consumption of appliances, electric baseboard heaters and lighting (Figure 10).The house temperature and humidity were monitored with a ZigBee wireless sensor network located in nine different zones: in addition, the temperature and humidity of an external weather station was also included.Information was collected every 10 minutes for 137 days (4.5 months) from 11 January 2016 at 17:00 to 27 May 2016 at 18:00, with 19,736 measurements in total, where an entire day was formed of 144 measurements.Key values of the dataset are shown in Table 1.Deep reinforcement learning is an unsupervised learning algorithm, and therefore the validation did not include training iterations.The intelligent infrastructure presented in this article uses deep reinforcement learning to make three trend decisions with three associated neurons and an independent neuron that makes predictions of the values of the manager neuron q m : • q 0 predicts that the trend of the manager neuron q m is to go upwards or up; • q 1 predicts that the trend of the manager neuron q m is downwards or down; • q 2 predicts that the trend of the manager neuron q m is to keep its value or be equal; • q t+1 predicts the value of the manager neuron q m .

Deep Reinforcement Learning in Intelligent Infrastructure: Validation and Results
The intelligent infrastructure deep reinforcement algorithm was validated with a research dataset [27] (https://github.com/iamrishab/Data-driven-prediction-models-of-energy-use-ofappliances-in-a-low-energy-house)based on a house with electric metering with Meter-Bus energy counters that measure the energy consumption of appliances, electric baseboard heaters and lighting (Figure 10).The house temperature and humidity were monitored with a ZigBee wireless sensor network located in nine different zones: in addition, the temperature and humidity of an external weather station was also included.The intelligent infrastructure presented in this article uses deep reinforcement learning to make three trend decisions with three associated neurons and an independent neuron that makes predictions of the values of the manager neuron qm: • q0 predicts that the trend of the manager neuron qm is to go upwards or up; • q1 predicts that the trend of the manager neuron qm is downwards or down; • q2 predicts that the trend of the manager neuron qm is to keep its value or be equal; • qt+1 predicts the value of the manager neuron qm.

Deep Reinforcement Learning in Intelligent Infrastructure: Validation and Results
The intelligent infrastructure deep reinforcement algorithm was validated with a research dataset [27] (https://github.com/iamrishab/Data-driven-prediction-models-of-energy-use-of-appliances-in-a-low-energy-house)based on a house with electric metering with Meter-Bus energy counters that measure the energy consumption of appliances, electric baseboard heaters and lighting (Figure 10).The house temperature and humidity were monitored with a ZigBee wireless sensor network located in nine different zones: in addition, the temperature and humidity of an external weather station was also included.Information was collected every 10 minutes for 137 days (4.5 months) from 11 January 2016 at 17:00 to 27 May 2016 at 18:00, with 19,736 measurements in total, where an entire day was formed of 144 measurements.Key values of the dataset are shown in Table 1.Deep reinforcement learning is an unsupervised learning algorithm, and therefore the validation did not include training iterations.Information was collected every 10 min for 137 days (4.5 months) from 11 January 2016 at 17:00 to 27 May 2016 at 18:00, with 19,736 measurements in total, where an entire day was formed of 144 measurements.Key values of the dataset are shown in Table 1.Deep reinforcement learning is an unsupervised learning algorithm, and therefore the validation did not include training iterations.The intelligent infrastructure (Appendix B) consisted of a configuration of five sensor networks with one neuron per external sensor.There were 22 external sensor neurons in total connected to 5 manager neurons associated with energy consumption, indoor and outdoor temperature and humidity, respectively (Figure 11).The intelligent infrastructure (Appendix B) consisted of a configuration of five sensor networks with one neuron per external sensor.There were 22 external sensor neurons in total connected to 5 manager neurons associated with energy consumption, indoor and outdoor temperature and humidity, respectively (Figure 11).Equally, there were five management layers that made trend and value predictions using a Deep Reinforcement Learning (DRL) algorithm based on six different memory configurations, as shown in Table 2. Tables 3 and 4 show the number of rewards (R) or successes, penalizations (P) or misses and accuracy (A) for different values of the learning gradient β across the 19,736 data measurements for the DRL-0M no memory and DRL-FM full memory configurations, respectively, at medium threshold memory (α = 0.5).Equally, there were five management layers that made trend and value predictions using a Deep Reinforcement Learning (DRL) algorithm based on six different memory configurations, as shown in Table 2.  Tables 3 and 4 show the number of rewards (R) or successes, penalizations (P) or misses and accuracy (A) for different values of the learning gradient β across the 19,736 data measurements for the DRL-0M no memory and DRL-FM full memory configurations, respectively, at medium threshold memory (α = 0.5).There was a slight increment in accuracy with the introduction of deep reinforcement learning, although the values between different memory configurations were not very different (Figure 12): the learning gradient β had an impact on the accuracy, where its optimum value was 1 × 10 3 .Tables 5 and 6 show the number of rewards (R) or successes, penalizations (P) or misses and accuracy (A) for different values of the threshold memory α across the 19,736 data measurements for the DRL-0M no memory and DRL-FM full memory configurations, respectively, at a medium learning gradient (β = 1 × 10 3 ).Tables 5 and 6 show the number of rewards (R) or successes, penalizations (P) or misses and accuracy (A) for different values of the threshold memory α across the 19,736 data measurements for the DRL-0M no memory and DRL-FM full memory configurations, respectively, at a medium learning gradient (β = 1 × 10 3 ).Equivalent to the previous validation, there was a slight increment in accuracy with the introduction of deep reinforcement learning, although the values between different memory configurations were not very different (Figure 13): the threshold memory α did not have a great impact on the accuracy, although it peaked at a 0.25 value.Tables 7 and 8 show the root mean square error of the predicted values (Appendix A) against the real measurements for different values of the prediction memory γ across the 19,736 data measurements for the DRL-0M no memory and DRL-FM full memory configurations, respectively, at medium threshold memory (α = 0.5) and a medium learning gradient (β = 1 × 10 3 ).Tables 7 and 8 show the root mean square error of the predicted values (Appendix A) against the real measurements for different values of the prediction memory γ across the 19,736 data measurements for the DRL-0M no memory and DRL-FM full memory configurations, respectively, at medium threshold memory (α = 0.5) and a medium learning gradient (β = 1 × 10 3 ).
The addition of deep reinforcement learning increased the trend prediction decision; however, it also increased the error of the predictor neuron unexpectedly.The error decreased when the weighting factors rewarded the actual value rather than the previous prediction (Figure 14).The addition of deep reinforcement learning increased the trend prediction decision; however, it also increased the error of the predictor neuron unexpectedly.The error decreased when the weighting factors rewarded the actual value rather than the previous prediction (Figure 14).

Discussion
There are several challenges to the application of artificial intelligence to intelligent infrastructure that are described within this section: (1) The IoT enables greater sensing capabilities with distributed electronic devices that consume very low power and transmit information at very low bandwidth using LoRaWAN, Bluetooth, Wi-Fi or 5G Transmission networks.In addition, the IoT has additional application specific open protocols such as KNX, the Modbus protocol, BacNET/IT or Lonworks.This abundance of protocols and transmission networks will be designed to enable open and interoperable solutions from different manufacturers; (2) Virtual devices' data between different digital platforms and cloud infrastructure will be standardized with a common naming scheme and relationship mapping that identifies the dependencies between different devices using common normalization structures: ideally, semantic data must capture ontology and taxonomy between assets.In addition, data obtained from different applications and systems will be normalized in common data structures; (3) Real physical devices or assets will be tagged with Universally Unique Identifiers (UUIDs) using common asset nomenclature structures based on JSON, Cascading Style Sheets (CSS) or Exten-

Discussion
There are several challenges to the application of artificial intelligence to intelligent infrastructure that are described within this section: (1) The IoT enables greater sensing capabilities with distributed electronic devices that consume very low power and transmit information at very low bandwidth using LoRaWAN, Bluetooth, Wi-Fi or 5G Transmission networks.In addition, the IoT has additional application specific open protocols such as KNX, the Modbus protocol, BacNET/IT or Lonworks.This abundance of protocols and transmission networks will be designed to enable open and interoperable solutions from different manufacturers; (2) Virtual devices' data between different digital platforms and cloud infrastructure will be standardized with a common naming scheme and relationship mapping that identifies the dependencies between different devices using common normalization structures: ideally, semantic data must capture ontology and taxonomy between assets.In addition, data obtained from different applications and systems will be normalized in common data structures; (3) Real physical devices or assets will be tagged with Universally Unique Identifiers (UUIDs) using common asset nomenclature structures based on JSON, Cascading Style Sheets (CSS) or Extensible Markup Language (XML), among others.Asset information and variables will be transmitted to the IoT cloud with standardized transmission protocols such as a MQTT server for low bandwidth applications, Hypertext Transfer Protocol Secure (HTTPS) for reliable communications or CoAP for unreliable asynchronous communications; (4) The balance between expandability, availability and the cost effective servers and data hosting provided by the cloud against edge computing on a premise that enables additional resilience and independence will be considered in terms of reliability, cybersecurity, cost and functionality.
The management of devices administered by the cloud will also be normalized with additional applications that automate their configurations and updates; (5) The improved interconnection of devices and assets enabled by the IoT also increases cybersecurity risks that will be addressed with firewalls, demilitarized zones (DMZs), proxy servers, data encryption, blockchain, virtualization, microsegmentation and software define networks (SDNs); (6) Although data will be increasingly stored in redundant virtual platforms and will therefore be difficult to permanently remove, human data privacy will also be considered.Data will be encrypted, and access to private data will be monitored and authorized, where not all data will be stored: in addition, data will not identify their human generators and owners.The failure to be sensitive and open about data privacy will generate a human reaction against intelligent infrastructure; (7) Different infrastructure user interfaces will be unified in order to increase user experience (UX) from an end user perspective via mobile or web apps to management and operator users via common dashboards with unified single panes; (8) Human adoption of artificial intelligence, with its applications and innovations, will be gradual and inducted to enable a successful coexistence.Although AI and deep machine learning will enable intelligent infrastructure managers to make higher abstracted decisions as a result of enhanced data correlations that will provide tailored and greater insights, the operational and maintenance perspective can lead to job redundancies, as tasks can be done autonomously based on learned predictions.A clear example is the application of blockchain to digital ledgers that will enhance or replace the role of bankers, accountants or project managers; (9) AI will include ethics at every decision stage, enabling humans to override any AI decision to avoid catastrophic situations where AI could extinguish humans due to faulty sensor devices, nonexhaustive learning or intentional cyber attacks; (10) Finally, the additional digital infrastructure inserted into real infrastructure will increase its economic cost, where returns on investment (ROIs) are normally difficult to evidence or justify.Successful business cases that consider both CAPEX and OPEX will feature intangible benefits, applications or enhanced user experience that remain difficult to quantify from an economic perspective.A clear analogy is the quantification between the current economic benefit and the ROIs of railways built two centuries ago: the quality of life, mobility, business opportunities and user experience we are currently benefiting from would have been very difficult to justify during their respective feasibility stages.

Conclusions
This research proposed a deep reinforcement learning algorithm embedded in intelligent infrastructure that enables its adaptation to the external environment, learning from its users and monitoring its functionality in terms of assets, space and energy, therefore assisting managers or

Figure 2 .
Figure 2. The Internet of Things.

Figure 2 .
Figure 2. The Internet of Things.Figure 2. The Internet of Things.

Figure 2 .
Figure 2. The Internet of Things.Figure 2. The Internet of Things.

Figure 9 .
Figure 9. Deep reinforcement learning in intelligent infrastructure.

Figure 9 .
Figure 9. Deep reinforcement learning in intelligent infrastructure.

Figure 9 .
Figure 9. Deep reinforcement learning in intelligent infrastructure.
Full memory: learning starts since the beginning of actions t = 0 DRL-1D Partial memory: learning covers only last day t = l−1− 144 DRL-7D Partial memory: learning covers only last week t = l−1−144 × 7 DRL-DD Partial memory: learning covers same time for all previous days t = ∆144 DRL-WW Partial memory: learning covers same time for all previous weeks t = ∆144 × 7
Private Key, From, To, Value) Verify Signature = Function (Public Key, Signature, From, To, Value) as the probability of transition from state s t to state s t+1 under action a t ; • R a (s, s t ) as the immediate reward r t+1 after transition from state s t to state s t+1 under action a t ; • Rules that describe agent observations.