Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (22)

Search Parameters:
Keywords = contextual bandit

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
17 pages, 1590 KB  
Article
Integrating Contextual Causal Deep Networks and LLM-Guided Policies for Sequential Decision-Making
by Jong-Min Kim
Mathematics 2026, 14(2), 269; https://doi.org/10.3390/math14020269 - 10 Jan 2026
Viewed by 185
Abstract
Sequential decision-making is critical for applications ranging from personalized recommendations to resource allocation. This study evaluates three decision policies—Greedy, Thompson Sampling (via Monte Carlo Dropout), and a zero-shot Large Language Model (LLM)-guided policy (Gemini-1.5-Pro)—within a contextual bandit framework. To address covariate shift and [...] Read more.
Sequential decision-making is critical for applications ranging from personalized recommendations to resource allocation. This study evaluates three decision policies—Greedy, Thompson Sampling (via Monte Carlo Dropout), and a zero-shot Large Language Model (LLM)-guided policy (Gemini-1.5-Pro)—within a contextual bandit framework. To address covariate shift and assess subpopulation performance, we utilize a Collective Conditional Diffusion Network (CCDN) where covariates are partitioned into B=10 homogeneous blocks. Evaluating these policies across a high-dimensional treatment space (K=5, resulting in 25=32 actions), we tested performance in a simulated environment and three benchmark datasets: Boston Housing, Wine Quality, and Adult Income. Our results demonstrate that the Greedy strategy achieves the highest Model-Relative Optimal (MRO) coverage, reaching 1.00 in the Wine Quality and Adult Income datasets, though performance drops significantly to 0.05 in the Boston Housing environment. Thompson Sampling maintains competitive regret and, in the Boston Housing dataset, marginally outperforms Greedy in action selection precision. Conversely, the zero-shot LLM-guided policy consistently underperforms in numerical tabular settings, exhibiting the highest median regret and near-zero MRO coverage across most tasks. Furthermore, Wilcoxon tests reveal that differences in empirical outcomes between policies are often not statistically significant (ns), suggesting an optimization ceiling in zero-shot tabular settings. These findings indicate that while traditional model-driven policies are robust, LLM-guided approaches currently lack the numerical precision required for high-dimensional sequential decision-making without further calibration or hybrid integration. Full article
(This article belongs to the Special Issue Computational Methods and Machine Learning for Causal Inference)
Show Figures

Figure 1

19 pages, 2585 KB  
Article
SYMPHONY: Synergistic Hierarchical Metric-Fusion and Predictive Hybrid Optimization for Network Yield—A VANET Routing Protocol
by Abdul Karim Kazi, Muhammad Imran, Raheela Asif and Saman Hina
Sensors 2026, 26(1), 135; https://doi.org/10.3390/s26010135 - 25 Dec 2025
Viewed by 402
Abstract
Vehicular ad hoc networks (VANETs) must simultaneously satisfy stringent reliability, latency, and sustainability targets under highly dynamic urban and highway mobility. Existing solutions typically optimise one or two dimensions (link stability, clustering, or energy) but lack an integrated, adaptive mechanism that fuses heterogeneous [...] Read more.
Vehicular ad hoc networks (VANETs) must simultaneously satisfy stringent reliability, latency, and sustainability targets under highly dynamic urban and highway mobility. Existing solutions typically optimise one or two dimensions (link stability, clustering, or energy) but lack an integrated, adaptive mechanism that fuses heterogeneous metrics while remaining lightweight and deployable. This paper introduces a VANET routing protocol named SYMPHONY (Synergistic Hierarchical Metric-Fusion and Predictive Hybrid Optimization for Network Yield) that operates in three coordinated layers: (i) a compact neighbourhood filtering stage that reduces forwarding scope and eliminates transient relays, (ii) a cluster layer that elects resilient cluster heads using fuzzy energy-aware metrics and backup leadership, and (iii) a global inter-cluster optimizer that blends a GA-reseeded swarm metaheuristic with a stability-aware pheromone scheme to produce multi-objective routes. Crucially, SYMPHONY employs an ultra-lightweight online weight-adaptation module (contextual linear bandit) to tune metric fusion weights in response to observed rewards (packet delivery ratio, end-to-end delay, and Green Performance Index). We evaluated the proposed routing protocol SYMPHONY versus strong modern baselines across urban and highway scenarios with varying density and resource constraints. The results demonstrate that SYMPHONY improves packet delivery ratio by up to 12–18%, reduces latency by 20–35%, and increases the Green Performance Index by 22–45% relative to the best baseline, while keeping control overhead and per-node computation within practical bounds. Full article
Show Figures

Figure 1

18 pages, 484 KB  
Article
LLM-Guided Ensemble Learning for Contextual Bandits with Copula and Gaussian Process Models
by Jong-Min Kim
Mathematics 2025, 13(15), 2523; https://doi.org/10.3390/math13152523 - 6 Aug 2025
Cited by 1 | Viewed by 2863
Abstract
Contextual multi-armed bandits (CMABs) are vital for sequential decision-making in areas such as recommendation systems, clinical trials, and finance. We propose a simulation framework integrating Gaussian Process (GP)-based CMABs with vine copulas to model dependent contexts and GARCH processes to capture reward volatility. [...] Read more.
Contextual multi-armed bandits (CMABs) are vital for sequential decision-making in areas such as recommendation systems, clinical trials, and finance. We propose a simulation framework integrating Gaussian Process (GP)-based CMABs with vine copulas to model dependent contexts and GARCH processes to capture reward volatility. Rewards are generated via copula-transformed Beta distributions to reflect complex joint dependencies and skewness. We evaluate four policies—ensemble, Epsilon-greedy, Thompson, and Upper Confidence Bound (UCB)—over 10,000 replications, assessing cumulative regret, observed reward, and cumulative reward. While Thompson sampling and LLM-guided policies consistently minimize regret and maximize rewards under varied reward distributions, Epsilon-greedy shows instability, and UCB exhibits moderate performance. Enhancing the ensemble with copula features, GP models, and dynamic policy selection driven by a large language model (LLM) yields superior adaptability and performance. Our results highlight the effectiveness of combining structured probabilistic models with LLM-based guidance for robust, adaptive decision-making in skewed, high-variance environments. Full article
(This article belongs to the Special Issue Privacy-Preserving Machine Learning in Large Language Models (LLMs))
Show Figures

Figure 1

16 pages, 2246 KB  
Article
Context-Aware Beam Selection for IRS-Assisted mmWave V2I Communications
by Ricardo Suarez del Valle, Abdulkadir Kose and Haeyoung Lee
Sensors 2025, 25(13), 3924; https://doi.org/10.3390/s25133924 - 24 Jun 2025
Cited by 1 | Viewed by 1105
Abstract
Millimeter wave (mmWave) technology, with its ultra-high bandwidth and low latency, holds significant promise for vehicle-to-everything (V2X) communications. However, it faces challenges such as high propagation losses and limited coverage in dense urban vehicular environments. Intelligent Reflecting Surfaces (IRSs) help address these issues [...] Read more.
Millimeter wave (mmWave) technology, with its ultra-high bandwidth and low latency, holds significant promise for vehicle-to-everything (V2X) communications. However, it faces challenges such as high propagation losses and limited coverage in dense urban vehicular environments. Intelligent Reflecting Surfaces (IRSs) help address these issues by enhancing mmWave signal paths around obstacles, thereby maintaining reliable communication. This paper introduces a novel Contextual Multi-Armed Bandit (C-MAB) algorithm designed to dynamically adapt beam and IRS selections based on real-time environmental context. Simulation results demonstrate that the proposed C-MAB approach significantly improves link stability, doubling average beam sojourn times compared to traditional SNR-based strategies and standard MAB methods, and achieving gains of up to four times the performance in scenarios with IRS assistance. This approach enables optimized resource allocation and significantly improves coverage, data rate, and resource utilization compared to conventional methods. Full article
Show Figures

Figure 1

18 pages, 803 KB  
Article
Gaussian Process with Vine Copula-Based Context Modeling for Contextual Multi-Armed Bandits
by Jong-Min Kim
Mathematics 2025, 13(13), 2058; https://doi.org/10.3390/math13132058 - 21 Jun 2025
Cited by 1 | Viewed by 1182
Abstract
We propose a novel contextual multi-armed bandit (CMAB) framework that integrates copula-based context generation with Gaussian Process (GP) regression for reward modeling, addressing complex dependency structures and uncertainty in sequential decision-making. Context vectors are generated using Gaussian and vine copulas to capture nonlinear [...] Read more.
We propose a novel contextual multi-armed bandit (CMAB) framework that integrates copula-based context generation with Gaussian Process (GP) regression for reward modeling, addressing complex dependency structures and uncertainty in sequential decision-making. Context vectors are generated using Gaussian and vine copulas to capture nonlinear dependencies, while arm-specific reward functions are modeled via GP regression with Beta-distributed targets. We evaluate three widely used bandit policies—Thompson Sampling (TS), ε-Greedy, and Upper Confidence Bound (UCB)—on simulated environments informed by real-world datasets, including Boston Housing and Wine Quality. The Boston Housing dataset exemplifies heterogeneous decision boundaries relevant to housing-related marketing, while the Wine Quality dataset introduces sensory feature-based arm differentiation. Our empirical results indicate that the ε-Greedy policy consistently achieves the highest cumulative reward and lowest regret across multiple runs, outperforming both GP-based TS and UCB in high-dimensional, copula-structured contexts. These findings suggest that combining copula theory with GP modeling provides a robust and flexible foundation for data-driven sequential experimentation in domains characterized by complex contextual dependencies. Full article
Show Figures

Figure 1

25 pages, 1976 KB  
Article
Balancing Efficiency and Efficacy: A Contextual Bandit-Driven Framework for Multi-Tier Cyber Threat Detection
by Ibrahim Mutambik and Abdullah Almuqrin
Appl. Sci. 2025, 15(11), 6362; https://doi.org/10.3390/app15116362 - 5 Jun 2025
Cited by 3 | Viewed by 3014
Abstract
In response to the rising volume and sophistication of cyber intrusions, data-oriented methods have emerged as critical defensive measures. While machine learning—including neural network-based solutions—has demonstrated strong capabilities in identifying malicious activities, several fundamental challenges remain. Chief among these difficulties are the substantial [...] Read more.
In response to the rising volume and sophistication of cyber intrusions, data-oriented methods have emerged as critical defensive measures. While machine learning—including neural network-based solutions—has demonstrated strong capabilities in identifying malicious activities, several fundamental challenges remain. Chief among these difficulties are the substantial resource demands related to data preprocessing and inference procedures, limited scalability beyond centralized environments, and the necessity of deploying multiple specialized detection models to address diverse stages of the cyber kill chain. This paper introduces a contextual bandit-based reinforcement learning approach, designed to reduce operational expenditures and enhance detection cost-efficiency by introducing an adaptive decision boundary within a layered detection scheme. The proposed framework continually measures the confidence of each participating detection model, applying a reward-driven mechanism to balance cost and accuracy. Specifically, each potential action, representing a particular decision boundary, earns a reward reflecting its overall cost-to-effectiveness ratio, thereby prioritizing reduced overheads. We validated our method using two highly representative datasets that capture prevalent modern-day threats: phishing and malware. Our findings show that this contextual bandit-based strategy adeptly regulates the frequency of resource-intensive detection tasks, significantly lowering both inference and processing expenses. Remarkably, it achieves this reduction with minimal compromise to overall detection accuracy and efficacy. Full article
(This article belongs to the Special Issue Advances in Internet of Things (IoT) Technologies and Cybersecurity)
Show Figures

Figure 1

20 pages, 896 KB  
Article
MAB-Based Online Client Scheduling for Decentralized Federated Learning in the IoT
by Zhenning Chen, Xinyu Zhang, Siyang Wang and Youren Wang
Entropy 2025, 27(4), 439; https://doi.org/10.3390/e27040439 - 18 Apr 2025
Viewed by 895
Abstract
Different from conventional federated learning (FL), which relies on a central server for model aggregation, decentralized FL (DFL) exchanges models among edge servers, thus improving the robustness and scalability. When deploying DFL into the Internet of Things (IoT), limited wireless resources cannot provide [...] Read more.
Different from conventional federated learning (FL), which relies on a central server for model aggregation, decentralized FL (DFL) exchanges models among edge servers, thus improving the robustness and scalability. When deploying DFL into the Internet of Things (IoT), limited wireless resources cannot provide simultaneous access to massive devices. One must perform client scheduling to balance the convergence rate and model accuracy. However, the heterogeneity of computing and communication resources across client devices, combined with the time-varying nature of wireless channels, makes it challenging to estimate accurately the delay associated with client participation during the scheduling process. To address this issue, we investigate the client scheduling and resource optimization problem in DFL without prior client information. Specifically, the considered problem is reformulated as a multi-armed bandit (MAB) program, and an online learning algorithm that utilizes contextual multi-arm slot machines for client delay estimation and scheduling is proposed. Through theoretical analysis, this algorithm can achieve asymptotic optimal performance in theory. The experimental results show that the algorithm can make asymptotic optimal client selection decisions, and this method is superior to existing algorithms in reducing the cumulative delay of the system. Full article
(This article belongs to the Section Information Theory, Probability and Statistics)
Show Figures

Figure 1

53 pages, 1295 KB  
Review
Selective Reviews of Bandit Problems in AI via a Statistical View
by Pengjie Zhou, Haoyu Wei and Huiming Zhang
Mathematics 2025, 13(4), 665; https://doi.org/10.3390/math13040665 - 18 Feb 2025
Cited by 9 | Viewed by 2542
Abstract
Reinforcement Learning (RL) is a widely researched area in artificial intelligence that focuses on teaching agents decision-making through interactions with their environment. A key subset includes multi-armed bandit (MAB) and stochastic continuum-armed bandit (SCAB) problems, which model sequential decision-making under uncertainty. This review [...] Read more.
Reinforcement Learning (RL) is a widely researched area in artificial intelligence that focuses on teaching agents decision-making through interactions with their environment. A key subset includes multi-armed bandit (MAB) and stochastic continuum-armed bandit (SCAB) problems, which model sequential decision-making under uncertainty. This review outlines the foundational models and assumptions of bandit problems, explores non-asymptotic theoretical tools like concentration inequalities and minimax regret bounds, and compares frequentist and Bayesian algorithms for managing exploration–exploitation trade-offs. Additionally, we explore K-armed contextual bandits and SCAB, focusing on their methodologies and regret analyses. We also examine the connections between SCAB problems and functional data analysis. Finally, we highlight recent advances and ongoing challenges in the field. Full article
(This article belongs to the Special Issue Advances in Statistical AI and Causal Inference)
Show Figures

Figure 1

17 pages, 2767 KB  
Article
Adaptive Noise Exploration for Neural Contextual Multi-Armed Bandits
by Chi Wang, Lin Shi and Junru Luo
Algorithms 2025, 18(2), 56; https://doi.org/10.3390/a18020056 - 21 Jan 2025
Viewed by 2165
Abstract
In contextual multi-armed bandits, the relationship between contextual information and rewards is typically unknown, complicating the trade-off between exploration and exploitation. A common approach to address this challenge is the Upper Confidence Bound (UCB) method, which constructs confidence intervals to guide exploration. However, [...] Read more.
In contextual multi-armed bandits, the relationship between contextual information and rewards is typically unknown, complicating the trade-off between exploration and exploitation. A common approach to address this challenge is the Upper Confidence Bound (UCB) method, which constructs confidence intervals to guide exploration. However, the UCB method becomes computationally expensive in environments with numerous arms and dynamic contexts. This paper presents an adaptive noise exploration framework to reduce computational complexity and introduces two novel algorithms: EAD (Exploring Adaptive Noise in Decision-Making Processes) and EAP (Exploring Adaptive Noise in Parameter Spaces). EAD injects adaptive noise into the reward signals based on arm selection frequency, while EAP adds adaptive noise to the hidden layer of the neural network for more stable exploration. Experimental results on recommendation and classification tasks show that both algorithms significantly surpass traditional linear and neural methods in computational efficiency and overall performance. Full article
(This article belongs to the Section Algorithms for Multidisciplinary Applications)
Show Figures

Figure 1

20 pages, 351 KB  
Article
Multilevel Constrained Bandits: A Hierarchical Upper Confidence Bound Approach with Safety Guarantees
by Ali Baheri
Mathematics 2025, 13(1), 149; https://doi.org/10.3390/math13010149 - 3 Jan 2025
Cited by 6 | Viewed by 6079
Abstract
The multi-armed bandit (MAB) problem is a foundational model for sequential decision-making under uncertainty. While MAB has proven valuable in applications such as clinical trials and online advertising, traditional formulations have limitations; specifically, they struggle to handle three key real-world scenarios: (1) when [...] Read more.
The multi-armed bandit (MAB) problem is a foundational model for sequential decision-making under uncertainty. While MAB has proven valuable in applications such as clinical trials and online advertising, traditional formulations have limitations; specifically, they struggle to handle three key real-world scenarios: (1) when decisions must follow a hierarchical structure (as in autonomous systems where high-level strategy guides low-level actions); (2) when there are constraints at multiple levels of decision-making (such as both system-wide and component-level resource limits); and (3) when available actions depend on previous choices or context. To address these challenges, we introduce the hierarchical constrained bandits (HCB) framework, which extends contextual bandits to incorporate both hierarchical decisions and multilevel constraints. We propose the HC-UCB (hierarchical constrained upper confidence bound) algorithm to solve the HCB problem. The algorithm uses confidence bounds within a hierarchical setting to balance exploration and exploitation while respecting constraints at all levels. Our theoretical analysis establishes that HC-UCB achieves sublinear regret, guarantees constraint satisfaction at all hierarchical levels, and is near-optimal in terms of achievable performance. Simple experimental results demonstrate the algorithm’s effectiveness in balancing reward maximization with constraint satisfaction. Full article
Show Figures

Figure 1

16 pages, 2172 KB  
Article
Time-Varying Preference Bandits for Robot Behavior Personalization
by Chanwoo Kim, Joonhyeok Lee, Eunwoo Kim and Kyungjae Lee
Appl. Sci. 2024, 14(23), 11002; https://doi.org/10.3390/app142311002 - 26 Nov 2024
Viewed by 1470
Abstract
Robots are increasingly employed in diverse services, from room cleaning to coffee preparation, necessitating an accurate understanding of user preferences. Traditional preference-based learning allows robots to learn these preferences through iterative queries about desired behaviors. However, these methods typically assume static human preferences. [...] Read more.
Robots are increasingly employed in diverse services, from room cleaning to coffee preparation, necessitating an accurate understanding of user preferences. Traditional preference-based learning allows robots to learn these preferences through iterative queries about desired behaviors. However, these methods typically assume static human preferences. In this paper, we challenge this static assumption by considering the dynamic nature of human preferences and introduce the discounted preference bandit method to manage these changes. This algorithm adapts to evolving human preferences and supports seamless human–robot interaction through effective query selection. Our approach outperforms existing methods in time-varying scenarios across three key performance metrics. Full article
(This article belongs to the Special Issue Advanced Control Systems and Applications)
Show Figures

Figure 1

38 pages, 1053 KB  
Article
Thompson Sampling for Stochastic Bandits with Noisy Contexts: An Information-Theoretic Regret Analysis
by Sharu Theresa Jose and Shana Moothedath
Entropy 2024, 26(7), 606; https://doi.org/10.3390/e26070606 - 17 Jul 2024
Cited by 3 | Viewed by 1961
Abstract
We study stochastic linear contextual bandits (CB) where the agent observes a noisy version of the true context through a noise channel with unknown channel parameters. Our objective is to design an action policy that can “approximate” that of a Bayesian oracle that [...] Read more.
We study stochastic linear contextual bandits (CB) where the agent observes a noisy version of the true context through a noise channel with unknown channel parameters. Our objective is to design an action policy that can “approximate” that of a Bayesian oracle that has access to the reward model and the noise channel parameter. We introduce a modified Thompson sampling algorithm and analyze its Bayesian cumulative regret with respect to the oracle action policy via information-theoretic tools. For Gaussian bandits with Gaussian context noise, our information-theoretic analysis shows that under certain conditions on the prior variance, the Bayesian cumulative regret scales as O˜(mT), where m is the dimension of the feature vector and T is the time horizon. We also consider the problem setting where the agent observes the true context with some delay after receiving the reward, and show that delayed true contexts lead to lower regret. Finally, we empirically demonstrate the performance of the proposed algorithms against baselines. Full article
(This article belongs to the Special Issue Information Theoretic Learning with Its Applications)
Show Figures

Figure 1

18 pages, 1434 KB  
Article
Dynamic Grouping within Minimax Optimal Strategy for Stochastic Multi-ArmedBandits in Reinforcement Learning Recommendation
by Jiamei Feng, Junlong Zhu, Xuhui Zhao and Zhihang Ji
Appl. Sci. 2024, 14(8), 3441; https://doi.org/10.3390/app14083441 - 18 Apr 2024
Cited by 1 | Viewed by 1738
Abstract
The multi-armed bandit (MAB) problem is a typical problem of exploration and exploitation. As a classical MAB problem, the stochastic multi-armed bandit (SMAB) is the basis of reinforcement learning recommendation. However, most existing SMAB and MAB algorithms have two limitations: (1) they do [...] Read more.
The multi-armed bandit (MAB) problem is a typical problem of exploration and exploitation. As a classical MAB problem, the stochastic multi-armed bandit (SMAB) is the basis of reinforcement learning recommendation. However, most existing SMAB and MAB algorithms have two limitations: (1) they do not make full use of feedback from the environment or agent, such as the number of arms and rewards contained in user feedback; (2) they overlook the utilization of different action selections, which can affect the exploration and exploitation of the algorithm. These limitations motivate us to propose a novel dynamic grouping within the minimax optimal strategy in the stochastic case (DG-MOSS) algorithm for reinforcement learning recommendation for small and medium-sized data scenarios. DG-MOSS does not require additional contextual data and can be used for recommendation of various types of data. Specifically, we designed a new exploration calculation method based on dynamic grouping which uses the feedback information automatically in the selection process and adopts different action selections. During the thorough training of the algorithm, we designed an adaptive episode length to effectively improve the training efficiency. We also analyzed and proved the upper bound of DG-MOSS’s regret. Our experimental results for different scales, densities, and field datasets show that DG-MOSS can yield greater rewards than nine baselines with sufficiently trained recommendation and demonstrate that it has better robustness. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

24 pages, 5426 KB  
Article
Managing Considerable Distributed Resources for Demand Response: A Resource Selection Strategy Based on Contextual Bandit
by Zhaoyu Li and Qian Ai
Electronics 2023, 12(13), 2783; https://doi.org/10.3390/electronics12132783 - 23 Jun 2023
Cited by 3 | Viewed by 2307
Abstract
The widespread adoption of distributed energy resources (DERs) leads to resource redundancy in grid operation and increases computation complexity, which underscores the need for effective resource management strategies. In this paper, we present a novel resource management approach that decouples the resource selection [...] Read more.
The widespread adoption of distributed energy resources (DERs) leads to resource redundancy in grid operation and increases computation complexity, which underscores the need for effective resource management strategies. In this paper, we present a novel resource management approach that decouples the resource selection and power dispatch tasks. The resource selection task determines the subset of resources designated to participate in the demand response service, while the power dispatch task determines the power output of the selected candidates. A solution strategy based on contextual bandit with DQN structure is then proposed. Concretely, an agent determines the resource selection action, while the power dispatch task is solved in the environment. The negative value of the operational cost is used as feedback to the agent, which links the two tasks in a closed-loop manner. Moreover, to cope with the uncertainty in the power dispatch problem, distributionally robust optimization (DRO) is applied for the reserve settlement to satisfy the reliability requirement against this uncertainty. Numerical studies demonstrate that the DQN-based contextual bandit approach can achieve a profit enhancement ranging from 0.35% to 46.46% compared to the contextual bandit with policy gradient approach under different resource selection quantities. Full article
Show Figures

Figure 1

23 pages, 1318 KB  
Review
A Systematic Study on Reinforcement Learning Based Applications
by Keerthana Sivamayil, Elakkiya Rajasekar, Belqasem Aljafari, Srete Nikolovski, Subramaniyaswamy Vairavasundaram and Indragandhi Vairavasundaram
Energies 2023, 16(3), 1512; https://doi.org/10.3390/en16031512 - 3 Feb 2023
Cited by 113 | Viewed by 24206
Abstract
We have analyzed 127 publications for this review paper, which discuss applications of Reinforcement Learning (RL) in marketing, robotics, gaming, automated cars, natural language processing (NLP), internet of things security, recommendation systems, finance, and energy management. The optimization of energy use is critical [...] Read more.
We have analyzed 127 publications for this review paper, which discuss applications of Reinforcement Learning (RL) in marketing, robotics, gaming, automated cars, natural language processing (NLP), internet of things security, recommendation systems, finance, and energy management. The optimization of energy use is critical in today’s environment. We mainly focus on the RL application for energy management. Traditional rule-based systems have a set of predefined rules. As a result, they may become rigid and unable to adjust to changing situations or unforeseen events. RL can overcome these drawbacks. RL learns by exploring the environment randomly and based on experience, it continues to expand its knowledge. Many researchers are working on RL-based energy management systems (EMS). RL is utilized in energy applications such as optimizing energy use in smart buildings, hybrid automobiles, smart grids, and managing renewable energy resources. RL-based energy management in renewable energy contributes to achieving net zero carbon emissions and a sustainable environment. In the context of energy management technology, RL can be utilized to optimize the regulation of energy systems, such as building heating, ventilation, and air conditioning (HVAC) systems, to reduce energy consumption while maintaining a comfortable atmosphere. EMS can be accomplished by teaching an RL agent to make judgments based on sensor data, such as temperature and occupancy, to modify the HVAC system settings. RL has proven beneficial in lowering energy usage in buildings and is an active research area in smart buildings. RL can be used to optimize energy management in hybrid electric vehicles (HEVs) by learning an optimal control policy to maximize battery life and fuel efficiency. RL has acquired a remarkable position in robotics, automated cars, and gaming applications. The majority of security-related applications operate in a simulated environment. The RL-based recommender systems provide good suggestions accuracy and diversity. This article assists the novice in comprehending the foundations of reinforcement learning and its applications. Full article
Show Figures

Figure 1

Back to TopTop