Next Article in Journal
Sustainable Valorization of Cattle Manure: Efficacy and Trade-Offs in Post-Digestion Strategies
Previous Article in Journal
Experiential Inclusivity in Retail Interiors: A Mixed-Methods Study of Family Experiences in Department Stores
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Energy-Efficient and Maintenance-Aware Control of a Residential Split-Type Air Conditioner Using an Enhanced Deep Q-Network

by
Natdanai Kiewwath
1,
Pattaraporn Khuwuthyakorn
2,3 and
Orawit Thinnukool
2,3,*
1
Division of Knowledge and Innovation Management, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai 50200, Thailand
2
Division of Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai 50200, Thailand
3
Innovative Research and Computational Science Lab, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai 50200, Thailand
*
Author to whom correspondence should be addressed.
Sustainability 2026, 18(7), 3578; https://doi.org/10.3390/su18073578
Submission received: 5 March 2026 / Revised: 31 March 2026 / Accepted: 2 April 2026 / Published: 6 April 2026
(This article belongs to the Special Issue AI in Smart Cities and Urban Mobility)

Abstract

Residential air conditioning systems are a major contributor to household electricity consumption in tropical regions, where environmental factors such as climate variability and particulate pollution (PM10) can further increase cooling demand and accelerate equipment degradation. This study proposes an Enhanced Deep Q-Network (Enhanced DQN) for energy-efficient and maintenance-aware control of residential split-type air conditioners under dynamic environmental conditions. The proposed method integrates several stability-oriented reinforcement learning mechanisms, including Double Q-learning, a dueling architecture, prioritized experience replay, multi-step returns, Bayesian-style regularization via Monte Carlo dropout, and entropy-aware exploration. The framework is evaluated through a two-stage process consisting of a diagnostic benchmark on LunarLander-v3 to assess learning stability, followed by a realistic 365-day simulation driven by Thai weather and PM10 data. Compared with a fixed 25 °C baseline, the proposed controller reduced annual electricity consumption from 5116.22 kWh to as low as 4440.03 kWh, corresponding to a saving of 13.22%. The learned policy also exhibited environmentally adaptive behavior under high PM10 conditions, indicating maintenance-aware characteristics. These findings demonstrate that reinforcement learning can provide robust, adaptive, and sustainable control strategies for residential cooling systems in tropical environments.

1. Introduction

Over the past several decades, global electricity demand has increased continuously. This growing demand for electricity arises from several factors, including economic expansion, industrial growth, the increase in population, and urbanization [1]. In the building sector, electricity demand has also been increasing steadily, contributing a substantial share of final energy consumption and energy-related CO2 emissions [2]. A report by the International Energy Agency (IEA) states that electricity use for space cooling by air conditioning systems in buildings accounts for about 20% of total building electricity consumption, and this share is expected to keep increasing as global average temperatures rise and average incomes grow, enabling more households to purchase and use air conditioners [3].
For Thailand, which is located in a tropical region, space cooling in residential buildings is essential. In the household sector, split-type air conditioners are very popular because they are relatively inexpensive and easy to install. However, a survey of residential electricity consumption in Thailand in 2018 reported that the total residential electricity consumption was 35,624 GWh, of which air conditioners accounted for 26.50%, followed by refrigerators at 19.33% [4]. In addition, national energy statistics report that residential electricity consumption has increased continuously from 53,747 GWh in 2022 to 57,726 GWh in 2023, and in 2024 residential electricity consumption increased by 7.7% compared to the previous year. These reports conclude that electricity demand in the residential sector is likely to keep increasing due to urban expansion and hotter climatic conditions [5,6]. Efficient operation and proper maintenance of air conditioners can reduce electricity consumption and greenhouse gas emissions, which aligns with the Sustainable Development Goals (SDGs), in particular SDG 7 on affordable and clean energy and SDG 13 on climate action [7,8].
In practice, the performance and efficiency of air conditioners do not depend only on the thermostat setpoint and the operating schedule. There are also external factors such as global warming, which increases global temperature trends, climate variability, and air pollution from fine particulate matter such as PM10. These external factors affect the performance of air conditioners in several ways. Higher temperatures and climate variability increase cooling demand, while particulate pollution can promote dust accumulation on air filters and heat exchanger surfaces, which may reduce operating efficiency and increase cleaning and maintenance requirements over time [9,10].
In conventional practice, air conditioners are usually operated with a fixed temperature setpoint and scheduled operating hours based on government energy saving campaigns, which primarily focus on user thermal comfort. For maintenance, preventive maintenance is typically performed according to a predefined schedule, and if the air conditioner fails, a technician is called to repair it, which is corrective maintenance [11]. These traditional approaches cannot adequately cope with dynamically changing weather conditions, temporal variations in fine particulate pollution, or seasonal differences in climate. They also do not explicitly adapt to seasonal conditions while simultaneously considering user comfort, energy savings, and long-term maintenance of air conditioner components. Thermal comfort in residential environments is commonly discussed with reference to ASHRAE-55-based studies and tropical residential field evidence, which often indicate acceptable or neutral comfort temperatures in the mid-20 °C range [12,13]. These limitations motivate the need for energy-efficient and maintenance-aware control policies that can adapt to continuously changing environmental conditions, respond to seasonal dust levels, and support maintenance activities so that the air conditioner can operate for a longer lifetime.
Reinforcement Learning (RL) is a principled framework for sequential decision-making under uncertainty [14]. In an RL formulation, an agent interacts with a stochastic environment. In each state, the agent selects an action and then receives a reward from that action. The agent therefore learns to make decisions under different situations to obtain the best long-term outcome. Problems that use reinforcement learning are typically formulated as a Markov Decision Process (MDP). The main objective is to learn a policy that maximizes the expected discounted return over the long run. Q-learning, or classical tabular Q-learning, can be regarded as a stochastic approximation scheme for solving the Bellman optimality equation, and it is well known that it converges to the optimal action value function under standard conditions on the learning rate, exploration, and the stationarity of the Markov decision process [15]. However, when Q-learning is combined with nonlinear function approximation, bootstrapping, and off-policy learning, these three elements together create what is known as the “deadly triad”, which can lead to instability and divergence [14].
The Deep Q-Network (DQN) architecture mitigates some of these issues by using a deep neural network to approximate the action value function, together with experience replay and a target network to stabilize training [16]. DQN has achieved human level performance on Atari game benchmarks, but it still suffers from well-known limitations such as overestimation bias in Q-values, sensitivity to hyperparameters, and brittle exploration based on ε-greedy policies. To address these weaknesses, several extensions have been proposed. Double Q-learning decouples action selection from target evaluation to reduce overestimation bias [17]; dueling network architectures separate state-value and advantage estimation, improving representation learning in states where many actions have similar returns [18]; prioritized experience replay (PER) samples transitions with large temporal-difference (TD) errors more frequently, accelerating the propagation of informative updates [19]; and multi-step returns incorporate rewards from several future steps, improving credit assignment in long horizon tasks [14]. In addition, Bayesian regularization can be applied to neural network parameters to penalize overly complex models and reduce overfitting, which is particularly important when training deep value functions on non-stationary or noisy data.
Another line of research introduces entropy-based regularization into the RL objective. In maximum-entropy RL and soft Q-learning, the agent seeks to maximize not only the expected return but also the expected policy entropy, leading to “soft” Bellman operators and stochastic policies that are both high-performing and exploratory [20]. This perspective underlies Soft Actor-Critic and related algorithms, which have demonstrated improved robustness and sample efficiency in continuous control [21]. When entropy regularization is incorporated into value-based methods, it can be interpreted as a mechanism for stabilizing exploration and avoiding premature convergence to deterministic policies, especially in multi objective environments where several actions may have comparable returns. In the context of air conditioning control, entropy-based exploration control is appealing because it can encourage diverse operating patterns during training, while Bayesian regularization and multi-step returns support more stable learning of long-horizon energy and maintenance trade-offs.
Deep RL has been applied increasingly to building and HVAC control, with studies showing that well-designed agents can reduce energy consumption while maintaining thermal comfort in both simulated and real-world environments [22,23,24]. Existing work includes applications to chiller plants, variable air volume systems, and home energy management, often reporting significant energy savings relative to rule-based baselines. However, relatively few studies focus specifically on residential split-type air conditioners in hot and humid climates, and most prior work primarily targets energy efficiency and comfort without explicitly modeling the interaction between control behavior, dust related performance degradation, and maintenance-awareness. Moreover, many implementations rely on standard DQN variants without systematically combining multiple stability-enhancing mechanisms or explicitly integrating Bayesian and entropy regularization into the learning process.
To address these gaps, this paper proposes an Enhanced Deep Q-Network (Enhanced DQN) for energy-efficient and maintenance-aware control of residential split-type air conditioners. The proposed algorithm integrates Double Q-learning, dueling network architecture, prioritized experience replay, multi-step returns, Bayesian regularization, and entropy-based exploration control into a unified value-based RL framework [16,17,18,19,20]. From a design perspective, these components are chosen to improve learning stability, support multi-objective operation, and make the agent suitable for long horizon tasks in dynamically changing environments. The Enhanced DQN is first evaluated through a diagnostic benchmark on the LunarLander-v3 environment using multiple random seeds to analyze convergence behavior and policy variance. It is then subjected to an applied evaluation by simulating the control of a 15,000 BTU residential split-type air conditioner, operating hourly for 365 days under environmental conditions that include both Thailand’s seasonal weather and varying PM10 dust levels.
The experimental results show that the proposed control approach can reduce annual electricity consumption by up to 13.22% compared with a constant temperature setting of 25 °C, demonstrating tangible energy savings relative to a widely recommended reference setpoint. In addition, the learned policy exhibits maintenance-aware control behavior, adjusting operating patterns in response to dust conditions and implicitly supporting better maintenance planning. These findings highlight the potential of reinforcement learning to promote intelligent energy management and sustainable maintenance for residential cooling systems, in line with SDG 7 and SDG 13 [7,8].
The main contributions of this paper are threefold:
  • We formulate residential split-type air conditioner operations in a tropical country as an energy-efficient and maintenance-aware RL problem that explicitly accounts for seasonal weather variation, PM10 dust levels, and long-horizon interactions between control actions, energy use, and maintenance-related behavior.
  • We develop an Enhanced DQN that combines Double Q-learning, dueling network architecture, prioritized experience replay, multi-step returns, Bayesian regularization, and entropy-based exploration control into a unified stability-oriented framework suitable for multi-objective, long-horizon control.
  • We conduct a two-stage evaluation involving a diagnostic benchmark on LunarLander-v3 using multiple random seeds to study learning stability and policy variance, followed by a 365-day simulation of a 15,000 BTU residential split-type air conditioner under realistic Thai weather and PM10 conditions, demonstrating annual energy savings of approximately 13.2% relative to a fixed 25 °C baseline.
The remainder of this paper is organized as follows. Section 2 reviews related work on deep RL for building and HVAC control and on stability-enhancing extensions of DQN. Section 3 presents the problem formulation, system model, and RL setup for residential split-type air conditioner control. Section 4 describes the proposed Enhanced DQN architecture and its Bayesian and entropy regularization mechanisms. Section 5 details the experimental design and reports the results for both the LunarLander-v3 and air-conditioning simulations. Section 6 discusses the implications, limitations, and potential extensions of this work. Finally, Section 7 concludes the paper and outlines directions for future research.

2. Related Work

2.1. Global Cooling Demand, SDGs, Split-Type Air Conditioners, and Smart Buildings

As discussed in Section 1, electricity use in the building sector has been increasing steadily and already accounts for a substantial share of final energy consumption and energy related CO2 emissions worldwide [1,2]. According to the International Energy Agency (IEA), space cooling is one of the fastest growing end uses of electricity. Air conditioners and electric fans together consume roughly one fifth of all electricity used in buildings and cooling demand could increase by more than a factor of two to three by 2050 in the absence of additional efficiency measures [1,3]. This trend poses a direct challenge to the Sustainable Development Goals (SDGs), particularly SDG 7 on affordable and clean energy and SDG 13 on climate action, because increasing peak electricity demand from air conditioning requires additional investment in power generation and network infrastructure and contributes to higher greenhouse gas emissions [7,8].
In tropical countries such as Thailand, residential space cooling is essential, and split-type air conditioners are widely adopted because they are relatively inexpensive and easy to install. National surveys indicate that air conditioners account for the largest share of household electricity use, and residential demand has continued to rise with urbanization, higher incomes, and warmer climatic conditions [4,5,6]. In addition, external stressors such as climate variability and particulate pollution can degrade air conditioner performance. Higher ambient temperatures increase cooling loads, while particulate matter accelerates filter and coil fouling, which reduces airflow and heat transfer and forces fans and compressors to work harder [9,10].
In practice, residential air conditioners are still commonly operated using simple strategies, such as fixed temperature setpoints recommended by government energy-saving campaigns and coarse on/off schedules. Maintenance is typically performed using time-based preventive routines, for example, periodic cleaning every few months, while corrective maintenance is carried out when failures occur [11]. Such traditional strategies are not designed to cope with dynamically changing weather, seasonal PM10 patterns, or long-horizon interactions between control actions, energy use, and equipment degradation. These limitations motivate energy-efficient and maintenance-aware control policies that adapt to environmental dynamics and support more proactive maintenance.
In parallel, smart buildings have emerged as an important response to these challenges, integrating intelligent energy management, automated building systems, and IoT-based sensing to support adaptive control of HVAC, lighting, and other building services [25]. In this context, HVAC systems are often regarded as the energy heart of the building, as they represent a dominant controllable load that directly affects comfort and electricity costs [26]. Smart building and smart home platforms rely on IoT devices and sensors to measure variables such as temperature, humidity, indoor air quality, occupancy, and equipment states, thereby providing the data needed for real-time control and optimization [27].
Building Management Systems (BMS) supervise and control subsystems in commercial and institutional buildings by combining sensor data with algorithms for setpoint adjustment, fault detection, and alarm generation [28]. At the residential scale, energy-efficient smart home systems emphasize decision-making processes that reduce energy use and costs while maintaining comfort and responding to occupant behavior [29]. In this study, these developments matter for two reasons. First, they provide sensing and communication infrastructure to observe the operation of individual split-type air conditioners in real time. Second, they enable tighter integration of reinforcement learning with predictive and prescriptive maintenance by treating operational signals as both control inputs and health indicators for maintenance analytics.

2.2. Predictive and Prescriptive Maintenance for HVAC and Cooling Systems

Maintenance is a critical function in asset life-cycle management because it affects availability, reliability, and total life-cycle cost. Manzini et al. define maintenance as a set of technical, administrative, and managerial actions performed throughout an asset’s life to preserve it in, or restore it to, a state in which it can perform its required function [30]. Maintenance strategies range from reactive run-to-failure approaches to time-based preventive maintenance, and further to condition-based maintenance, predictive maintenance, and prescriptive maintenance.
In the HVAC domain, the literature on PdM has grown rapidly. Systematic reviews such as that of Zonta et al. highlight the multidisciplinary nature of PdM in the Industry 4.0 context, spanning data acquisition, feature extraction, modelling, decision support and business integration [31]. Essakali et al. surveyed PdM algorithms applied to HVAC systems, covering air handling units (AHUs), chiller plants, cooling towers, and duct systems, and discussed approaches ranging from signal analysis and health-index construction to machine learning and deep-learning methods for fault detection and remaining useful life (RUL) prediction. Other studies develop data-driven PdM and fault-detection frameworks for whole-building HVAC systems using real sensor data, with the aim of improving energy efficiency while scheduling maintenance more effectively [26,30]. Zonta et al. proposed a taxonomy that classifies PdM approaches by model type, including regression models for RUL estimation, classification models for health-state or fault-type prediction, and survival models for failure probability over time, as well as by modelling paradigm, namely physics-based, knowledge-based, data-driven, and hybrid approaches [31]. The emerging consensus is that future PdM systems will be increasingly hybrid, combining data-driven models with domain knowledge to achieve robust performance under real world constraints.
Beyond PdM, prescriptive maintenance (PsM) has been proposed as the most advanced stage of knowledge-based maintenance. Instead of stopping at when will the asset fail, PsM explicitly addresses what should be done, when, and how to optimize cost, risk and system performance. Bokrantz et al. analyzed the role of maintenance in digitalized manufacturing and develop Delphi-based scenarios for maintenance in 2030, linking PsM to Industry 4.0 and to horizontally and vertically integrated data flows in factories [32]. Khoshafian and Rostetter introduced the notion of digital prescriptive maintenance, emphasizing the integration of IoT, the Process of Everything, and business process management (BPM) to connect sensor data with automated maintenance decision-making [33]. Ansari et al. proposed the PriMa framework as a prescriptive maintenance model for cyber-physical production systems, explicitly linking sensor data, models, and knowledge bases to prescriptive maintenance planning [34].
Artificial intelligence (AI) and machine learning provide an overarching framework for methods that enable machines to exhibit human-like capabilities such as reasoning, learning, and decision-making [35]. Within this framework, machine learning (ML) is often regarded as the core mechanism by which systems improve performance with experience. Within this broader context, Mitchell defined machine learning as the study of algorithms that enable computer programs to improve their performance through experience [36]. Dasgupta and Nath classify ML algorithms into supervised, unsupervised and reinforcement learning, based on the availability of labels and the nature of the learning objective [37].
In maintenance applications, ML has been used for continuous variable prediction, classification of normal and faulty states, anomaly detection using unsupervised methods and system level cost or risk estimation using deep models. Nguyen et al. developed an AI-based maintenance decision-making and optimization framework for multi-state component systems, showing that artificial neural networks (ANNs) can accurately estimate maintenance costs at the system level and that multi-agent deep RL can further improve decision efficiency [38].
Reinforcement learning has attracted growing interest in maintenance planning and operation. Garcia and Rachelson provide a comprehensive overview of Markov decision processes (MDPs) as the mathematical foundation for sequential decision-making under uncertainty, while Sutton and Barto present RL methods ranging from tabular algorithms to deep RL [14]. In a recent review, Ogunfowora and Najjaran survey reinforcement and deep reinforcement learning-based solutions for machine maintenance planning, scheduling policies, optimization, and reported that roughly 70% of the surveyed work uses Q-learning or DQN variants, reflecting the model-free nature of many maintenance problems [39]. Rocchetta et al. proposed a reinforcement learning framework for optimal operation and maintenance of power grids, using prognostics and health management (PHM) information to support joint operation and maintenance decisions and to maximize expected profit under environmental uncertainty [40].
Despite these advances, most PdM/PsM and RL-based maintenance frameworks focus on industrial machinery or large HVAC plants in commercial buildings. The control layer (which governs equipment operation and energy use) and the maintenance planning layer are usually treated separately: controllers aim to track loads or minimize energy, while PdM/PsM models use health indicators and failure probabilities to plan interventions. There is comparatively little work on maintenance-aware controllers whose objective functions natively combine energy, comfort, and maintenance stability for example, by limiting compressor start–stop frequency or adapting control patterns to dust-related performance degradation.
In the Thai context, the authors previously proposed a reinforcement learning-based prescriptive maintenance model for air-conditioning systems which formulated an RL framework to link operational data from air conditioners to maintenance decisions at the policy level [41]. Another study applied machine learning approaches to energy forecasting and management for a university building in Chiang Mai, demonstrating that data-driven models can support building-level energy management decisions effectively [42]. However, these earlier works focused on forecasting and policy-level maintenance planning. The present study builds on this line of research by designing an Enhanced DQN that is maintenance-aware by construction and by embedding maintenance-related stability metrics directly into the reward function of an RL controller for residential split-type air conditioners.

2.3. RL Theory, the Deadly Triad, and the Deep Q-Network Family

From a theoretical perspective, reinforcement learning problems are typically formulated as Markov decision processes (MDPs), where an agent interacts with a stochastic environment by observing states, selecting actions, and receiving rewards, with the aim of maximizing long-term discounted return. Classical tabular Q-learning can be viewed as a stochastic approximation scheme for solving the Bellman optimality equation, and it is well known to converge to the optimal action-value function under standard conditions on the learning rate, exploration, and stationarity of the MDP [14,15].
When Q-learning is combined with nonlinear function approximation, bootstrapping, and off-policy learning, the three elements together form the so-called deadly triad, which can cause instability or even divergence in value estimates [14]. Mnih et al. proposed the Deep Q-Network (DQN) architecture to mitigate these issues by using a deep neural network to approximate the Q-function together with an experience replay buffer and a target network to reduce correlation in training data and stabilize parameter updates [16]. Subsequent work introduced several extensions. Double DQN addresses overestimation bias by decoupling action selection and action evaluation in the target update [17]. Dueling network architectures separate state-value and advantage streams, improving value estimation in states where many actions have similar returns [18]. Prioritized experience replay (PER) samples transitions in proportion to their temporal-difference (TD) error, emphasizing informative samples and improving sample efficiency [19]. These components form a family of increasingly stable DQN variants that have been applied in various control domains, including energy and HVAC applications [22,23,24,39,40].
In the present work, these ideas are combined in an Enhanced DQN that integrates Double Q-learning, dueling network architecture, PER, and multi-step returns into a single value-based RL agent tailored to long-horizon, multi-objective control of split-type air conditioners.

2.4. Maximum-Entropy RL and Entropy Regularization in Q-Learning

Maximum-entropy reinforcement learning extends the traditional RL objective by augmenting the expected return with an entropy term that encourages stochastic policies with sufficient exploration [20]. This leads to soft Bellman operators and soft Q-functions, where the value of a state–action pair reflects both its expected return and the entropy of the resulting policy. Algorithms such as Soft Q-learning and Soft Actor–Critic (SAC) have demonstrated that entropy regularization can substantially improve learning stability and sample efficiency in a wide range of continuous-control tasks [20,21]. SAC has become a widely adopted baseline in robotics and energy-related control problems.
Recent analytical work has linked non-uniform sampling schemes and entropy regularization to stochastic approximation theory, clarifying conditions under which the choice of sampling distribution and entropy weight preserves or improves convergence properties of Q-learning like algorithms. Although many practical applications of maximum entropy RL focus on actor–critic methods with continuous action spaces, the underlying ideas can also be adapted to value-based methods by incorporating entropy terms into the value function and the policy.
In the context of air-conditioning control, entropy-based exploration control is appealing because it can encourage the agent to explore diverse operating patterns during training, rather than prematurely collapsing to a brittle deterministic policy. Combined with Bayesian regularization of network parameters used to penalize overly complex models and mitigate overfitting in non-stationary or noisy environments and multi-step returns to improve long-horizon credit assignment, these mechanisms provide a stability-oriented design that is well suited to controlling split-type air conditioners over a full year of operation under varying weather and PM10 conditions.

2.5. Deep RL for Building and HVAC Control and Remaining Gaps

Deep RL has been increasingly applied to building energy management and HVAC control. Recent reviews by Yu et al. [22], Fu et al. [23] and Wang et al. [24] summarize a broad range of RL and deep RL applications, including temperature control, chiller plant optimization, variable air volume (VAV) systems, demand response and home energy management systems (HEMS), in both simulation and real-world buildings. These studies consistently report that RL-based controllers can reduce energy consumption relative to rule-based or, in some cases, model predictive control (MPC) baselines, while maintaining or improving occupant comfort. Several previous studies have represented thermal comfort using simplified temperature-based criteria, typically expressed in terms of fixed or dynamic temperature setpoints, setpoint-tracking performance, or broad acceptable indoor temperature ranges rather than occupant-specific comfort models [43,44,45]. Such temperature ranges are often selected with reference to established thermal comfort guidelines, including ASHRAE Standard 55 [12,13].
At the level of concrete case studies, Wang et al. applied DQN-based controllers to smart-building HVAC systems and showed that learning from operational data can reduce energy costs while maintaining comfort [24]. Dong et al. reviewed smart building sensing systems for indoor environment control and highlighted the importance of sensor-based monitoring infrastructure in enabling adaptive HVAC operation and energy-efficient building management [27]. Geraldo Filho et al. reviewed energy-efficient smart home systems and highlighted that both infrastructure design and decision-making processes play essential roles in enabling adaptive and energy-efficient operation in residential environments [29].
Nevertheless, most of the existing literature focuses on large systems—such as chiller plants, VAV systems, or whole-building HVAC—or on demand response and load management, rather than on individual residential split-type air conditioners in hot and humid climates. Moreover, relatively few studies formulate the control problem as a multi-objective task that explicitly emphasizes maintenance stability, such as limiting compressor cycling or incorporating simple proxies of equipment degradation linked to dust accumulation and air quality.
In the Thai context, the authors have previously investigated machine learning-based energy forecasting and management for a university building in Chiang Mai, showing that data-driven models can support building-level energy management decisions [42]. However, that work focused primarily on forecasting and analytical energy management, rather than direct RL-based control of individual devices.
The present study addresses these gaps along two dimensions: First, at the application level, by focusing on a residential split-type air conditioner in a tropical climate, modelling seasonal weather and PM10 dust levels explicitly within the RL environment; and Second, at the algorithmic level, by proposing an Enhanced DQN that combines Double Q-learning, dueling networks, PER, multi-step returns, Bayesian regularization, and entropy-based exploration control into a unified stability-oriented framework. The agent is trained and evaluated over a 365-day control horizon, with a reward function that balances energy consumption, thermal comfort, and maintenance-related stability metrics, thereby positioning the work at the intersection of deep RL, smart building control, and prescriptive maintenance.
In terms of reported performance, several prior studies on RL-based HVAC control have demonstrated energy savings typically ranging from approximately 5% to 20%, depending on building type, control objectives, and evaluation conditions. While direct numerical comparison is difficult due to differences in system scale, environment modeling, and experimental protocols, these results provide a contextual reference for interpreting the performance of the proposed method.

3. Problem Formulation, System Model, and Reinforcement Learning Setup

This study considers long-horizon control of a residential split-type air conditioner operating under time-varying outdoor weather and particulate matter (PM10) exposure. The controller is designed to jointly reduce electricity consumption, maintain acceptable thermal comfort, and mitigate maintenance-related degradation over an annual operating cycle. To support decision-making under stochastic transitions and delayed operational consequences, we formulate the task as a reinforcement learning (RL) problem using a Markov decision process (MDP).

3.1. Markov Decision Process Formulation

The control task is modeled as an MDP defined by the tuple S A P R γ , where S denotes the state space, A is a finite discrete action space, P ( s t + 1 s t , a t ) represents the transition dynamics, R denotes the reward function, and γ ( 0,1 is the discount factor. At each decision step t , the agent observes s t S , selects an action a t A , receives a scalar reward r t , and transitions to the next state s t + 1 .
The objective is to maximize the expected discounted return:
J ( π ) = E π t = 0 T 1 γ t r t ,
where T denotes the episode length and π ( a s ) is the control policy. In this study, each episode corresponds to a full-year simulation with hourly resolution ( T = 8760 time steps). This formulation enables the agent to learn long-horizon control policies under seasonal variability, time-varying environmental conditions, and delayed system responses. The exact horizon definition and implementation details are provided in Section 5 [14].

3.2. State Representation

The state representation is designed to capture the thermal environment, air-quality exposure, equipment condition, and temporal context relevant to air-conditioning control. In this study, the state vector consists of eight variables:
s t = T i n d o o r , H i n d o o r , T o u t d o o r , H o u t d o o r , P M 10 , F h e a l t h , h , d
where
  • T i n d o o r : indoor air temperature (°C)
  • H i n d o o r : indoor relative humidity (%)
  • H o u t d o o r : outdoor air temperature (°C)
  • H o u t d o o r : outdoor relative humidity (%)
  • PM10: particulate matter concentration (µg/m3)
  • F h e a l t h : filter health index (normalized between 0 and 1)
  • h : hour of day (0–23)
  • d : day of year (1–365)
Indoor temperature is approximated using a simplified thermal response model driven by outdoor conditions and cooling actions. As a result, indoor temperature evolves dynamically rather than being fixed or directly equal to outdoor temperature. Indoor humidity is included as a state variable but is not directly penalized in the reward function, allowing the model to focus on temperature-based comfort control. The inclusion of PM10 and filter health enables the agent to account for environmental air quality and equipment degradation. Filter health is modeled as a simplified PM10-driven degradation process, where filter health decreases proportionally to PM10 concentration. This abstraction reflects the effect of dust accumulation on airflow reduction and increased system load. All continuous variables are normalized to the range [0, 1] using min–max scaling to improve learning stability and prevent dominance of variables with larger numerical scales.
This formulation should be interpreted as an approximate Markov representation under partial observability. While the true system may violate strict Markov assumptions due to unobserved factors such as occupancy and internal heat gains, the inclusion of temporal variables and observable thermal responses provides sufficient information for stable policy learning in practice.

3.3. Discrete Action Space

To ensure compatibility with value-based deep reinforcement learning, the action space is defined as a discrete set of feasible air conditioner control commands. Each action corresponds to a composite operating configuration:
a t = v a n e , t e m p , f a n , m o d e
where
  • v a n e : airflow direction control
  • t e m p : temperature setpoint (mapped to 18–31 °C)
  • f a n : fan speed level (low, medium, high)
  • m o d e : operation mode (5 discrete modes)
The resulting action space is implemented as a MultiDiscrete control structure, with a total of 420 possible action combinations. This formulation corresponds to a discrete combination of control variables rather than a continuous control space. In the current implementation, temperature setpoint and fan speed directly influence thermal dynamics and energy consumption, while other dimensions are retained for extensibility and future system integration. Although some action dimensions have limited immediate influence in the simplified environment, they are retained for extensibility. Learning efficiency is supported through prioritized experience replay and multi-step returns to mitigate potential redundancy effects.
Although some action dimensions are not strongly coupled to the current simplified dynamics, preliminary experiments indicated that the learning process remains stable without significant degradation in convergence behavior. This suggests that the adopted architecture and sampling mechanisms are sufficient to mitigate potential inefficiencies arising from action-space redundancy.

3.4. Reward Scalarization for Multi-Objective Control

The HVAC control problem involves competing objectives, including reducing energy consumption, maintaining thermal comfort, and limiting maintenance-related degradation. A scalar reward function is therefore used to aggregate these objectives into a single learning signal:
r t = w E E t w C C t w M M t
where
  • E t : energy consumption term
  • C t = T i n d o o r T t a r g e t : comfort penalty defined as the absolute deviation between indoor temperature and the target setpoint
  • M t = 1 F h e a l t h : maintenance-related degradation penalty
The coefficients w E , w C , w M define the relative importance of each objective. The comfort term is based on temperature deviation from a target setpoint, reflecting common practice in residential cooling control. Humidity is not directly included in the reward function to maintain model simplicity and interpretability. The maintenance term is modeled using a filter degradation proxy rather than an explicit physical dust accumulation model. This abstraction captures the key effect of PM10 exposure on system performance, where higher particulate concentration leads to faster degradation, reduced airflow efficiency, and increased energy demand. To ensure meaningful trade-offs, all reward components are scaled to comparable magnitudes. Multiple reward-weight configurations are evaluated to analyze sensitivity and robustness of the learned policy.

3.5. Decision Interval and Episode Definition

The controller operates at a fixed hourly decision interval. Each episode represents a full-year operating cycle consisting of 8760-time steps. This long-horizon formulation enables the agent to learn policies that account for:
  • seasonal weather variability
  • time-varying PM10 concentration
  • cumulative degradation effects
The full specification of environmental inputs, simulation assumptions, and evaluation protocol is described in Section 5. This formulation provides a practical and computationally efficient framework for learning adaptive air-conditioning control policies under realistic environmental conditions.

4. Proposed Enhanced DQN Architecture and Learning Components

This section introduces the proposed Enhanced Deep Q-Network (Enhanced DQN), a stability-oriented, value-based reinforcement learning algorithm tailored for long-horizon control under non-stationary disturbances. The method combines complementary learning components to reduce maximization bias, improve representation and sample efficiency, and promote sustained exploration.

4.1. Overview of Enhanced DQN

Enhanced DQN retains the standard DQN framework, which learns an action–value function using a neural network, but augments it with Double DQN target estimation, a dueling value–advantage decomposition, prioritized experience replay, multi-step returns, Monte Carlo (MC) dropout as Bayesian-style regularization, and entropy-aware exploration. Together, these mechanisms improve robustness to stochastic transitions and shifting operating contexts.

4.2. Action-Value Learning and Bellman Optimality

The action–value function under policy π is defined as:
Q π ( s , a ) = E π k = 0 γ k r t + k s t = s , a t = a .
The optimal action–value function satisfies the Bellman optimality equation:
Q * ( s , a ) = E r t + γ m a x a   Q * ( s t + 1 , a ) s t = s , a t = a .
A deep neural network approximates the action–value function and is trained by minimizing the temporal-difference (TD) error using experience replay and a separate target network [16].

4.3. Double DQN Target Estimation and Dueling Architecture

To reduce overestimation that arises from max-based bootstrapping, Double DQN decouples action selection and evaluation: the online network selects the maximizing action, while the target network evaluates that action when computing the TD target [17].
The dueling architecture decomposes the Q-function into a state-value stream and an advantage stream. This decomposition can improve representation efficiency, particularly in states where many actions yield similar value [18].

4.4. Prioritized Experience Replay

Prioritized experience replay (PER) increases sample efficiency by sampling transitions in proportion to their TD error magnitude. Importance-sampling weights are used to correct the bias introduced by non-uniform sampling [19].

4.5. Multi-Step Returns

To improve credit assignments in long-horizon tasks, Enhanced DQN uses multi-step returns. Compared with one-step bootstrapping, the n-step target propagates delayed consequences more effectively through the value function and can reduce reliance on inaccurate bootstrapped estimates early in training [14,46].
A typical n-step target can be written as
G t n = k = 0 n 1 γ k r t + k + γ n m a x a   Q target ( s t + n , a ) .

4.6. Bayesian-Style Regularization via Monte Carlo Dropout

To avoid overconfident value estimates under uncertainty, MC dropout is used as a Bayesian-style regularized. Dropout remains active during training, and multiple stochastic forward passes are aggregated to approximate predictive uncertainty, which can stabilize temporal-difference updates in noisy settings [47,48]. The dropout rate (p = 0.1) was selected based on common practice in Bayesian deep learning, where moderate dropout provides effective regularization without significantly degrading predictive stability. Preliminary experiments confirmed that this value improves robustness of value estimation without introducing excessive variance. A full hyperparameter search was not performed due to computational constraints.
A simple aggregation form is
Q ¯ ( s , a ) = 1 M m = 1 M Q θ , m ( s , a ) ,
where Q θ , m denotes the Q-value from the m -th stochastic forward pass with dropout enabled.

4.7. Entropy-Aware Exploration

To prevent premature convergence to brittle deterministic policies, Enhanced DQN employs entropy-aware exploration during training by sampling actions from a softmax distribution over estimated Q-values. The temperature parameter controls exploration intensity. During evaluation, the policy acts greedily. The temperature parameter (τ = 0.5) was selected to balance exploration and exploitation, following common practice in entropy-regularized reinforcement learning. Lower values tend to produce near-deterministic policies, while higher values increase stochasticity. The chosen value was validated through preliminary experiments to ensure stable learning without excessive randomness. A full hyperparameter sweep was not conducted due to the high computational cost of long-horizon training. A typical softmax policy is
π ( a s ) = e x p ( Q ¯ ( s , a ) / τ ) a e x p ( Q ¯ ( s , a ) / τ ) .

4.8. Integrated Stability Perspective

Enhanced DQN addresses several common sources of instability in deep value-based learning. Double DQN mitigates maximization bias, dueling heads improve representation efficiency, PER improves data efficiency, and multi-step targets strengthen long-range credit assignment. MC dropout regularizes the function approximator under uncertainty, while entropy-aware exploration promotes sustained exploration when operating contexts evolve over time (Algorithm 1).
Algorithm 1. Training Procedure of the Proposed Enhanced DQN
1: Initialize online Q-network Q θ with random parameters θ .
2: Initialize target network Q θ with parameters θ θ .
3: Initialize prioritized replay buffer B .
4: For episode = 1 to N episodes do
 4.1 Reset environment and observe initial state s 0 .
 4.2 For t = 0  to T 1 do
  (a) Select action a t using entropy-aware exploration π ( a s t ) .
  (b) Execute a t , observe r t and next state s t + 1 .
  (c) Store transition s t a t r t s t + 1 in B with priority.
  (d) Sample a minibatch from B according to priorities; compute IS weights.
  (e) Compute n-step target using Double DQN target estimation and Q θ .
  (f) Update θ by minimizing the weighted TD loss; apply gradient clipping.
  (g) Periodically update θ θ (target network synchronization).
 4.3 End for
5: End for

5. Experimental Setup and Applied Case Study

This section describes a two-stage evaluation of the proposed Enhanced DQN. Stage 1 uses LunarLander-v3 as a diagnostic benchmark to assess learning dynamics, convergence behavior, and robustness across random seeds. Stage 2 evaluates the learned policy in a custom simulation of a residential split-type air conditioner operated hourly for 365 days (8760 steps). The diagnostic stage provides evidence of training stability before transferring the controller to the long-horizon HVAC control task, which constitutes the primary applied contribution of this work.

5.1. Two-Stage Evaluation Overview

The overall workflow of the proposed study is illustrated in Figure 1.

5.2. Stage 1 Diagnostic Benchmark on LunarLander-v3

To analyze the learning characteristics of the proposed method under controlled conditions, a benchmark evaluation is conducted on LunarLander-v3. This benchmark is intentionally used as a diagnostic tool rather than the target application. The objective is to examine learning dynamics, convergence behavior, and stability across stochastic training conditions before applying the method to the one-year split-type air conditioning control task.
Training protocol and reproducibility. Unless otherwise stated, each method is trained under the same episode budget and repeated across multiple random seeds. Learning curves are reported as the mean episodic return across seeds, with variability summarized as ±1 standard deviation. For readability, a moving-average smoothing is applied to the learning curves; the smoothing window is specified in the plotting script used to generate the result visualizations.
This benchmark is selected due to its well-established use in reinforcement learning research, providing a controlled yet stochastic environment to evaluate convergence behavior, stability, and policy learning dynamics. In particular, LunarLander-v3 provides a stochastic and continuous control environment with delayed rewards, making it suitable for evaluating stability and exploration behavior in value-based reinforcement learning algorithms.

5.3. Stage 1 Baselines Training Protocol and Hyperparameters

5.3.1. Comparative Baselines

Enhanced DQN is compared against three value-based baselines: DQN, Double DQN (DDQN), and Dueling DQN. The baselines provide a consistent reference spectrum for isolating the contribution of the proposed stability-oriented enhancements.

5.3.2. Seeds and Episode Budget

For the diagnostic benchmark, each algorithm is trained for 2000 episodes per seed and repeated across 5 random seeds (0,1,2,3,4) to reduce stochastic bias and avoid over-interpreting a single favorable run. All experiments in this study are conducted using five independent random seeds to ensure statistical robustness and reproducibility. Performance metrics are reported as mean ± standard deviation across these seeds unless otherwise specified. The applied annual HVAC simulation is substantially more expensive. In the revised annual analysis, all reward-weight configurations (RW1–RW4) are evaluated using five random seeds to improve statistical consistency and to support a more direct comparison of reward-weight sensitivity. The corresponding annual results are presented in the Results section, including a focused comparison of the strongest configurations and a full summary of the annual energy outcomes across RW1–RW4.

5.3.3. Hyperparameter Configuration

To ensure scientific rigor, baseline methods share a standardized set of hyperparameters, whereas Enhanced DQN activates its specialized components, such as prioritized experience replay, MC dropout, multi-step returns, and entropy-aware exploration. The consolidated settings are summarized in Table 1.
In addition, a limited preliminary tuning process was conducted using a reduced training horizon and a small set of candidate values to verify stability trends before finalizing the reported configuration. This approach was adopted to balance methodological rigor with computational feasibility.
Baselines share the same base configuration; DDQN enables double-target estimation and Dueling DQN enables dueling heads. Enhanced DQN integrates PER, n-step returns, MC Dropout, and entropy-aware exploration.
The hyperparameters of Enhanced DQN were selected based on prior reinforcement learning studies and preliminary experiments to ensure stable convergence and robust learning behavior. These settings were chosen to support methodological consistency rather than exhaustive optimization. Due to the computational cost of repeated long-horizon training, a full hyperparameter sweep was not conducted. Sensitivity to reward weights is explicitly analyzed through the RW1–RW4 configurations.
All experiments were conducted on a system equipped with an Intel Core i5-9400 CPU (2.90 GHz; Intel Corporation, Santa Clara, CA, USA), 16 GB RAM, and an NVIDIA GeForce GTX 1660 Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA). The training time for the proposed model was approximately 24 h per seed.

5.4. Stage 2 Applied Case Study: Split-Type AC Simulation over 365 Days

This applied case study evaluates Enhanced DQN on a realistic long-horizon HVAC control task by simulating a residential split-type air conditioner at hourly resolution over 365 days (8760 steps) under time-varying outdoor conditions and PM10 exposure. The purpose is to assess whether the learned policy remains stable and interpretable over an annual decision horizon and to examine multi-objective trade-offs among energy consumption, maintenance-related degradation, and thermal comfort.

5.4.1. Environment Overview and Episode Definition

A custom Gymnasium-compatible environment (SplitTypeEnv) is developed for yearly HVAC simulation. Each episode proceeds sequentially through the rows of an input weather/PM dataset (one row per hour) and terminates after the final time step, corresponding to one full-year trajectory.
The environment provides an 8-dimensional observation vector including indoor and outdoor temperature, indoor and outdoor humidity, PM10 concentration, filter health, and temporal indicators such as hour of day and day of year.
The control action is represented as a MultiDiscrete tuple with 420 discrete combinations. In the current implementation, the temperature index, which is mapped to the setpoint, and the fan level directly affect the system dynamics, whereas the remaining dimensions are retained for extensibility. Figure 2 illustrates the overall structure of the SplitTypeEnv simulation environment, including the exogenous inputs, environment dynamics, reward computation, and interaction with the RL agent.

5.4.2. Exogenous Inputs: Outdoor Weather and PM10 (Data Sources and Fields)

The simulator is driven by exogenous inputs that represent outdoor weather conditions and particulate matter exposure (PM10). These variables are not controlled by the agent, but they strongly affect cooling demand and maintenance-related degradation. Outdoor meteorological observations are obtained from the Thai Meteorological Department (TMD), while PM10 concentrations are obtained from a DustBoy monitoring device. The two data sources are merged and aligned to an hourly timeline to drive the annual simulation.
For the applied evaluation, the environment consumes an hourly dataset with one record per hour over 365 days (8760 time-steps). The dataset includes the following fields.
  • day_of_year: day index in the annual cycle (1–365/366)
  • hour: hour-of-day (0–23)
  • outdoor_temp: outdoor air temperature (°C)
  • outdoor_humidity: outdoor relative humidity (%)
  • pm10: PM10 concentration (µg/m3)
Daily meteorological records from TMD are converted into hourly profiles using a diurnal distribution and interpolation scheme. This approach preserves seasonal trends from daily observations while introducing realistic within-day variability. DustBoy PM10 readings are time-stamped and aligned to the same hourly grid before being merged with the weather profiles.

5.4.3. Indoor Dynamics, Energy Proxy, and PM-Related Degradation

Indoor temperature and humidity are modeled using a simplified approximation designed to capture the essential thermal behavior required for reinforcement learning evaluation. Indoor temperature is estimated as a bounded response to outdoor conditions and cooling actions, remaining within a plausible operating range consistent with residential air-conditioning usage. Indoor humidity is similarly approximated from outdoor humidity with bounded adjustments to reflect typical comfort conditions. This simplified model enables efficient long-horizon simulation while preserving the key dynamics relevant to control decision-making. However, it does not represent a detailed physics-based building model and is not calibrated against real measured indoor data. Therefore, the model does not claim full physical realism. Instead, the environment is designed as a simplified simulation framework that provides a controlled and consistent setting for evaluating reinforcement learning behavior. The primary objective of this study is to assess relative performance improvements and policy behavior under consistent environmental conditions, rather than to reproduce exact real-world energy consumption values.
Therefore, the objective of this simulation environment is to enable comparative evaluation of control policies under consistent and controlled dynamics, rather than to provide accurate real-world energy consumption prediction. This approach is consistent with common practice in reinforcement learning-based HVAC studies, where simplified simulation environments are used to evaluate learning stability and control effectiveness. Future work will focus on validating the simulation model against real-world measurements and integrating more detailed building physics models.

5.4.4. Multi-Objective Reward via Scalarization

The applied case study uses a scalarized reward that combines three penalty components, namely energy cost, maintenance penalty, and comfort penalty, through weighted summation. This scalarization allows controlled manipulation of trade-offs by adjusting reward weights while retaining a single optimization objective compatible with DQN-family value learning.

5.4.5. Reward Weight Configurations RW1 to RW4

To demonstrate robustness without exhaustive tuning, a compact set of reward-weight configurations (RW1–RW4) is evaluated. These configurations represent a structured exploration of trade-offs among energy efficiency, maintenance-related degradation, and thermal comfort, rather than exhaustive optimization. A full hyperparameter sweep over reward weights was not conducted due to the high computational cost of long-horizon simulation. Table 2 summarizes the configurations and their design intent.

5.4.6. Experimental Design Matrix and Seeds

Each reward-weight configuration (RW1–RW4) was trained under the same environment settings and identical training hyperparameters, with only the reward weights varied across configurations. This design isolates the effect of reward preference from other algorithmic factors. To ensure consistency with Section 6, the evaluation settings are summarized as follows. First, benchmark robustness was assessed using five random seeds, and the results are reported as mean ± standard deviation. Second, the annual HVAC analysis is reported using five seeds for all reward-weight configurations, with a focused comparison of the strongest settings and a full summary of the annual results. Third, PM10-related behavioral adaptation is illustrated using RW4, as this configuration achieved the lowest mean annual energy consumption. Finally, reward-weight sensitivity is evaluated by comparing the annual energy consumption of RW1–RW4 in Section 6.4 and their corresponding monthly energy profiles.
The overall experimental design, including the benchmark setting, annual HVAC analysis, PM10-related behavioral evaluation, and reward-weight sensitivity analysis, is summarized in Table 3.

5.4.7. Evaluation Protocol Baseline vs. Learned Policy

The yearly evaluation compares the learned policy against a conventional fixed-setpoint baseline (25 °C). While more advanced control strategies such as model predictive control (MPC) and RL-based controllers exist in the literature, the fixed setpoint baseline is selected as a representative and widely used reference in residential applications. This allows a clear and interpretable comparison of energy savings and operational behavior under identical environmental conditions. The baseline applies a constant action mapping across all time steps, while the learned policy executes the trained model deterministically over the full-year trajectory.
Energy saving is computed as
S a v i n g ( % ) = E b a s e E A I E b a s e × 100 .
In addition to annual energy, per-step logs enable trajectory-level analysis (e.g., daily aggregation and monthly aggregation) used in Section 6. Therefore, the results should be interpreted as an application-oriented comparison against a practical residential reference, rather than as a definitive benchmark against all advanced control strategies.

5.4.8. Reporting Metrics Aligned with Section 6

To ensure consistency between the experimental setup and the results presented in Section 6, the evaluation metrics are specified as follows. In the LunarLander-v3 benchmark, performance is evaluated by the mean episodic return across random seeds with a ±1 standard deviation envelope and smoothing applied for visual clarity, as well as by the reward distribution over the final 100 episodes across seeds. In the applied 365-day air-conditioning case study, the reported metrics include annual energy consumption of the baseline strategy together with the two strongest reward-weight configurations, RW3 and RW4, and their corresponding energy-saving performance, supported by a full annual energy summary across RW1–RW4. In addition, daily energy consumption in relation to PM10 concentration is reported for RW4 to examine behavioral adaptation under the lowest-mean-energy configuration. Finally, the monthly energy profiles of RW1–RW4 are assessed relative to the baseline, together with the annual energy summary across reward-weight sensitivity settings.

6. Results

This section reports the experimental results of the proposed Enhanced Deep Q-Network (Enhanced DQN). The evaluation comprises two parts. First, LunarLander-v3 is used as a diagnostic benchmark to examine learning dynamics, convergence behavior, and stability under stochastic training conditions. Second, the main application results for split-type air conditioner energy optimization are presented, which constitute the primary contribution of this study.

6.1. Diagnostic Benchmark Evaluation on LunarLander-v3

To analyze the learning characteristics of the proposed method under controlled conditions, a comparative benchmark evaluation was conducted on the LunarLander-v3 environment. This benchmark is intentionally used as a diagnostic tool rather than the target application. The goal is to examine learning dynamics, convergence behavior, and stability under stochastic training conditions, before transferring the method to the one-year split-type air conditioning control task. Unless otherwise stated, each method was trained under the same episode budget and repeated across multiple random seeds; the learning curves report the mean episodic return across seeds with variability summarized as ±1 standard deviation. For readability, a moving-average smoothing is applied to the learning curves.
Classical variants commonly exhibit faster early-stage improvement in this benchmark environment, whereas the Enhanced DQN improves more gradually and maintains a larger variability band throughout training (see Figure 3). The horizontal reference line at 200 indicates the commonly used “solved” threshold for LunarLander-V3 and is included as a benchmark reference rather than a primary optimization target. From a diagnostic perspective, the observed learning profile suggests that the Enhanced DQN emphasizes sustained exploration and robustness, which is desirable for non-stationary control settings where operating conditions and disturbances evolve over time.
Beyond mean learning curves, distributional analysis is useful to assess stability and tail-risk behaviors that may be obscured by averages. Therefore, steady-state reward variability is summarized using the final episodes of each training run, where performance is expected to reflect late-training behavior after substantial learning has occurred.
Figure 4 presents the reward distribution over the final 100 training episodes across five random seeds for the benchmark algorithms.
Baseline variants typically show more concentrated distributions, while the Enhanced DQN can exhibit a broader distribution and occasional low-reward episodes. Importantly, this pattern does not imply that the proposed method is designed for benchmark-specific return maximization. Instead, the retained dispersion and occasional tail events indicate that the proposed method preserves exploratory capacity even in late training stages. Such characteristics are particularly relevant for real-world, noisy control problems, which are not fully represented by simplified benchmark tasks such as LunarLander-v3.
Overall, the benchmark results serve as diagnostic evidence that the proposed Enhanced DQN exhibits stable long-horizon learning behavior under stochastic training, motivating its subsequent evaluation in the target split-type air conditioning control scenario presented in the following sections. This diagnostic evaluation is not intended to demonstrate benchmark superiority, but rather to assess learning stability under stochastic conditions prior to deployment in the target application.

6.2. Annual Energy Optimization Performance on Split-Type Air Conditioning (Main Result)

The baseline controller, operating at a fixed setpoint of 25 °C, consumed 5116.22 kWh over the simulated annual period. Across the evaluated reward-weight configurations, the proposed Enhanced DQN consistently reduced annual energy consumption, with the strongest performance observed under RW4, followed closely by RW3. Because the baseline controller is deterministic, it exhibits no variability across seeds, whereas the RL-based controller shows only modest cross-seed variation, which is explicitly reported to ensure transparency and reproducibility.
The relatively low standard deviation indicates stable policy performance across different random initializations, confirming the robustness of the learned control strategy in long-horizon operation.

6.3. Annual Energy Consumption and Savings

Figure 5 compares the annual energy consumption of the fixed 25 °C baseline with that of the two strongest reward-weight configurations of the proposed Enhanced DQN, namely RW3 and RW4, reported as mean ± standard deviation across five random seeds (n = 5). The baseline consumed 5116.22 kWh over the simulated year, whereas Enhanced DQN under RW3 and RW4 consumed 4440.69 ± 27.90 kWh and 4440.03 ± 37.50 kWh, respectively. These values correspond to absolute reductions of 675.54 ± 27.90 kWh for RW3 and 676.20 ± 37.50 kWh for RW4.
In percentage terms, the observed reductions correspond to annual energy savings of 13.20% for RW3 and 13.22% for RW4 relative to the deterministic baseline. The difference between RW3 and RW4 is therefore very small in practical terms, indicating that both reward-weight configurations achieve similarly strong long-horizon energy-saving performance. At the same time, RW3 exhibits slightly lower cross-seed variability than RW4, suggesting marginally greater consistency across random initializations, whereas RW4 achieves the lowest mean annual energy consumption among the evaluated configurations.
Table 4 summarizes the annual energy consumption, energy reduction, and energy-saving percentage for all evaluated reward-weight configurations relative to the fixed 25 °C baseline. RW1, RW2, RW3, and RW4 achieved mean annual consumptions of 4521.20 ± 50.10, 4456.61 ± 20.16, 4440.69 ± 27.90, and 4440.03 ± 37.50 kWh, corresponding to savings of 11.63%, 12.90%, 13.20%, and 13.22%, respectively. These results indicate that the proposed framework consistently improves annual energy performance relative to the fixed-setpoint baseline across multiple reward-weight settings.
Overall, the findings confirm that the proposed Enhanced DQN can deliver substantial and repeatable annual energy savings in a realistic residential control scenario without relying on hand-crafted schedules or fixed heuristics. More importantly, the close performance of RW3 and RW4 suggests that the framework is not narrowly dependent on a single reward-weight choice but rather remains effective across nearby multi-objective preference settings.
RW4 achieved the lowest mean annual energy consumption, whereas RW3 showed slightly lower cross-seed variability. Together, these results indicate that the proposed framework is robust across reward-weight configurations and that strong energy-saving performance can be achieved without a single sharply dominant setting.

6.4. Seasonal and Environmental Adaptation with PM10 Awareness

To further investigate the behavioral characteristics of the learned policy under environmental variability, the relationship between daily energy consumption and particulate matter concentration (PM10) was analyzed over the full 365-day simulation period. In this section, the analysis is presented for the RW4 configuration, which achieved the lowest mean annual energy consumption among the evaluated reward-weight settings.
For the RW4 configuration, Figure 6 shows the 7-day moving-average profiles of daily energy consumption and PM10 concentration over the annual simulation horizon.
Figure 6 is intended to support behavioral interpretation rather than to establish a strict causal relationship between PM10 concentration and energy consumption. The purpose is to examine whether the learned control policy exhibits environmentally coherent behavior when air-quality conditions fluctuate over time. The energy trajectory under RW4 remains generally smooth over the full annual horizon, with no evidence of persistent oscillatory instability or prolonged uncontrolled spikes. Although short-term increases in energy use are observed during periods of elevated seasonal demand, the controller does not exhibit sustained excessive energy escalation during later periods of renewed PM10 increase. This suggests that the learned policy remains responsive to environmental conditions while avoiding unnecessarily aggressive operation.
Because the current environment uses a simplified PM10-driven degradation proxy rather than a detailed physical fouling model, the observed relationship should be interpreted as indirect evidence of maintenance-aware behavior rather than explicit physical validation. In this context, the relatively stable daily energy trajectory indicates that the controller is able to maintain adaptive long-horizon behavior under changing environmental conditions without sacrificing operational stability. These findings complement the annual energy results in Section 6.2 by showing that the proposed Enhanced DQN does not merely reduce aggregate annual energy use, but also exhibits environmentally adaptive and operationally coherent behavior over time.

6.5. Reward Weight Sensitivity and Robustness Analysis

To evaluate the robustness of the proposed framework with respect to reward-weight variations, annual and monthly energy consumption was compared across four reward-weight configurations (RW1–RW4). All configurations were trained and evaluated under identical environment settings and hyperparameters, with only the reward weights varied, allowing the effect of objective preference to be isolated directly. As summarized in Table 4, the fixed-setpoint baseline consumed 5116.22 kWh annually, whereas the Enhanced DQN achieved the following mean annual energy consumptions across five random seeds: 4521.20 ± 50.10 kWh for RW1, 4456.61 ± 20.16 kWh for RW2, 4440.69 ± 27.90 kWh for RW3, and 4440.03 ± 37.50 kWh for RW4. These correspond to annual energy savings of 11.63%, 12.90%, 13.20%, and 13.22%, respectively. While the current study evaluates a structured set of four configurations, the observed performance trends suggest that the policy behavior varies smoothly with respect to reward-weight changes rather than exhibiting abrupt instability. This indicates that the proposed framework is not highly sensitive to small perturbations in reward design within a reasonable parameter range.
These results show that all reward-weight configurations substantially improve annual energy performance relative to the baseline. The maximum deviation among RW1–RW4 is 88.75 kWh, which corresponds to only 1.73% of baseline annual consumption, indicating limited sensitivity to reward-weight tuning at the system level. Moreover, the reported standard deviations remain below 2% of annual consumption for all configurations, confirming stable behavior across random seeds. Among the tested settings, RW4 achieved the lowest mean annual energy consumption, although the difference between RW3 and RW4 is very small in practical terms. At the same time, RW3 exhibited slightly lower cross-seed variability than RW4, suggesting marginally greater consistency across random initializations. Taken together, these findings indicate that the proposed framework is robust rather than narrowly dependent on a single sharply tuned reward design.
Figure 7 compares the monthly energy consumption profiles of RW1–RW4 with the fixed 25 °C baseline over the annual simulation cycle.
In addition, the Figure 7 further supports this conclusion by showing that all reward-weight configurations consistently reduce monthly energy consumption relative to the fixed 25 °C baseline across the full year. Importantly, the seasonal demand profile remains preserved under all configurations, with higher consumption during peak cooling months and lower demand during milder periods. This indicates that the learned policies do not distort the natural cooling-demand structure of the environment. RW1 shows the highest monthly energy use among the RL configurations, especially during higher-demand periods, whereas RW2, RW3, and RW4 remain closely clustered and consistently below the baseline. RW3 and RW4 provide the strongest overall performance, with only minor month-to-month differences between them, further indicating that the observed annual gains are not the result of narrow reward over-specialization.
Overall, the reward-weight sensitivity analysis shows that the proposed Enhanced DQN maintains stable long-horizon control performance across multiple objective preferences. Reward-weight variation primarily influences the relative emphasis of the learned behavior rather than causing policy collapse, oscillatory instability, or seasonal divergence. This practical tunability is important for real-world deployment, where different users or operating contexts may prioritize energy efficiency, comfort, and maintenance-related considerations differently, while still requiring a robust and reliable control policy.

7. Discussion

This study investigated the applicability of a deep reinforcement learning-based control framework for improving the energy efficiency and operational stability of residential split-type air conditioning systems. The proposed Enhanced Deep Q-Network (Enhanced DQN) was designed to address the long-horizon and non-stationary characteristics of residential cooling environments, including seasonal weather variability and particulate pollution. Rather than focusing on benchmark optimization alone, the approach emphasizes robustness and adaptability in real operational contexts, which is increasingly important in smart building energy management and intelligent HVAC control systems.

7.1. Algorithmic Dynamics and Diagnostic Insights

The diagnostic benchmark experiment conducted on LunarLander-v3 provides useful insight into the learning behavior of the proposed framework. In comparison with conventional value-based algorithms, including DQN and Dueling DQN, the Enhanced DQN exhibited a more gradual convergence pattern and maintained a wider reward distribution throughout training. This behavior is consistent with reinforcement learning theory, in which exploration-preserving mechanisms may reduce the rate of early convergence but improve robustness under non-stationary conditions [14,16]. Several architectural components contribute to this characteristic. Double Q-learning reduces overestimation bias in action-value estimation [17], while the dueling architecture improves representation learning by separating state-value and advantage estimation [18]. In addition, prioritized experience replay emphasizes informative transitions and accelerates learning from high-temporal-difference-error samples [19]. When combined with entropy-aware exploration, these mechanisms promote continued policy adaptation rather than premature convergence to a potentially brittle deterministic policy. Such properties are particularly important in building control applications, where environmental disturbances and operating conditions may vary over time.

7.2. Application-Level Energy Performance and SDG 7 Relevance

The principal contribution of this research lies in the application-level evaluation using a full-year simulation of a residential air-conditioning system. Across the evaluated reward-weight configurations, the proposed Enhanced DQN consistently reduced annual electricity consumption relative to the conventional fixed 25 °C baseline. The strongest results were obtained under RW3 and RW4, which achieved annual energy savings of 13.20% and 13.22%, respectively. Although RW4 produced the lowest mean annual energy consumption, the difference between RW3 and RW4 was very small, while RW3 exhibited slightly lower cross-seed variability. These results demonstrate that reinforcement learning can identify control policies that adapt to seasonal cooling demand while maintaining stable long-horizon performance.
These findings align with broader research on reinforcement learning for building energy management, where data-driven control strategies have been shown to outperform rule-based or schedule-based methods in HVAC optimization [22,23]. In tropical climates such as Thailand, where residential cooling demand constitutes a substantial share of household electricity consumption, improvements in air-conditioner operation can have meaningful impacts on national energy consumption and emissions [4]. Consequently, intelligent control strategies such as the proposed approach contribute to broader sustainability objectives, particularly those associated with Sustainable Development Goal 7 (Affordable and Clean Energy).

7.3. Maintenance-Aware Behavior Under PM10 Conditions

Beyond energy efficiency, the analysis also revealed environmentally adaptive behavior in relation to particulate matter exposure. During periods of elevated PM10 concentration, the RL-controlled system did not exhibit abrupt increases in energy demand or unstable operational patterns. Instead, the policy maintained relatively stable energy consumption while responding to environmental variability. Although the present simulation did not explicitly model filter degradation or component wear, this pattern suggests a form of implicit maintenance-aware control. Previous studies suggest that particulate accumulation on air-conditioning filters and heat exchanger surfaces can degrade system efficiency, reduce heat-transfer performance, and increase maintenance requirements over time [9,10]. By avoiding excessive high-load operation during periods of poor air quality, the learned policy may indirectly reduce maintenance burden and extend equipment life. The integration of operational control with maintenance considerations is increasingly discussed in the literature on predictive and prescriptive maintenance systems [31,32]. However, most existing frameworks treat maintenance planning and control optimization as separate layers. The present study contributes to bridging this gap by demonstrating that reinforcement learning-based control policies can implicitly incorporate maintenance-relevant signals into operational decisions.

7.4. Reward-Weight Sensitivity, Policy Tunability, and SDG 13 Relevance

Another important finding concerns the robustness of the control framework to reward-weight variations. The reward-sensitivity experiment showed that annual energy savings remained relatively consistent across multiple reward configurations (RW1–RW4). Among the evaluated settings, RW4 produced the lowest mean annual energy consumption, while RW3 showed slightly lower variability across seeds. However, the difference between RW3 and RW4 remained very small, indicating that the system is not overly sensitive to specific reward-weight choices. Reward shaping plays a crucial role in reinforcement learning applications because it determines how competing objectives are balanced [14]. In building energy management, these objectives typically include energy consumption, thermal comfort, equipment stability, and economic cost [22,24]. The observed stability across reward configurations suggests that the proposed architecture learns a generalizable policy structure rather than exploiting narrow reward incentives.
From a sustainability perspective, this tunability is also relevant to SDG 13, which emphasizes climate action through adaptive and efficient resource use. A control framework that maintains stable energy-saving performance under different objective preferences is better suited to practical deployment in diverse environmental and operational conditions. This characteristic increases the potential of the proposed method as a flexible decision-support mechanism for intelligent and sustainable air-conditioning control.

7.5. Practical Deployment Challenges and Limitations

Despite the encouraging results, several practical issues remain for real-world deployment. First, implementation in residential environments would require additional IoT sensing infrastructure, including PM, temperature, and humidity sensors, together with a microcontroller and communication interface for continuous monitoring. A practical sensing node can be implemented using commercially available components such as a PMS7003 particulate matter sensor, an SHT30/SHT31 temperature-humidity sensor, and an ESP32-based controller. Based on currently available component prices in Thailand, such a sensing node would cost approximately 37–61 USD per installation point, excluding installation, calibration, and long-term maintenance costs. PMS7003 modules are commonly sold in the range of about 25–37 USD, SHT31 sensors around 4–6 USD, and ESP32 development boards around 1–7 USD, depending on vendor and board type.
Second, the present framework evaluates comfort mainly through environmental variables and a temperature-based reward formulation, without incorporating direct occupant comfort feedback. In practice, perceived comfort may differ across users and may also be influenced by activity level, clothing, occupancy patterns, and personal preference. As a result, the current formulation should be interpreted as a simplified comfort-oriented control model rather than a fully personalized comfort management system.
Third, the study relies on a simulation-based environment representing a residential air-conditioning system. Although the simulation incorporates realistic weather and PM10 data, it does not fully capture all physical dynamics of real buildings, such as occupant behavior, internal heat gains, and detailed thermal inertia. In particular, indoor thermal response is approximated using a simplified model driven by outdoor conditions and cooling actions, rather than a full physics-based building simulation. In addition, although the proposed Enhanced DQN demonstrates stable performance in offline training, practical real-time deployment would require reliable sensing, stable communication, robust data preprocessing, and sufficiently efficient inference on embedded or edge computing hardware. These requirements introduce additional implementation complexity beyond the simulation setting considered in this study.
In addition, the current framework does not incorporate dynamic or time-varying electricity pricing. The reward structure is therefore based on static energy-related penalties rather than real tariff fluctuations, which limits the economic realism of the present formulation.

7.6. Future Work Direction

Future work will focus on extending the framework toward real-world validation, including hardware-in-the-loop experiments and deployment in actual residential environments. In addition, further investigation of multi-objective optimization with dynamic electricity pricing and user feedback integration will enhance the practical applicability of the proposed approach.
In summary, this study demonstrates that reinforcement learning-based control can effectively address the challenges of long-horizon, non-stationary residential cooling systems. The proposed Enhanced DQN achieves significant energy savings while maintaining stable and adaptive behavior under varying environmental conditions. Importantly, the framework exhibits robustness across reward configurations without policy instability, indicating strong generalizability. These findings highlight the potential of reinforcement learning as a practical and scalable solution for intelligent energy management in residential buildings, contributing to sustainable and data-driven HVAC control systems.

8. Conclusions

This study developed an Enhanced Deep Q-Network (Enhanced DQN) framework for long-horizon energy optimization and maintenance-aware control of residential split-type air conditioners. The proposed approach integrates multiple stability- and exploration-oriented components, including Double Q-learning, a dueling architecture, prioritized experience replay, multi-step returns, Bayesian regularization, and entropy-guided action selection, to improve robustness under non-stationary operating conditions and extended control horizons. Rather than focusing on benchmark dominance, LunarLander-v3 was used as a diagnostic environment to examine learning dynamics, cross-seed variability, and exploration behavior. The results indicate that the proposed framework maintains sustained exploration and adaptive learning behavior under stochastic conditions. These results suggest that the proposed framework is computationally feasible for offline training, although further optimization is required for real-time deployment in embedded or edge environments.
At the application level, evaluation in a 365-day hourly simulation demonstrated that the Enhanced DQN consistently reduced annual electricity consumption relative to the fixed 25 °C baseline. Across the evaluated reward-weight configurations, the strongest performance was obtained under RW3 and RW4, which achieved annual energy savings of 13.20% and 13.22%, respectively. RW4 yielded the lowest mean annual energy consumption, whereas RW3 showed slightly lower cross-seed variability. These results confirm that the learned policy can effectively adapt to seasonal demand variations while maintaining stable long-horizon performance.
Seasonal and environmental analyses further show that the learned policy avoids excessive energy spikes under high PM10 conditions, indicating environmentally adaptive and operationally stable behavior. Although maintenance effects are inferred indirectly through environmental signals rather than explicitly modeled, the observed behavior is consistent with maintenance-aware operation that may reduce mechanical stress over time.
Reward-weight sensitivity analysis further demonstrated controlled tunability. Across configurations (RW1–RW4), the maximum deviation in annual energy consumption remained small relative to baseline demand, indicating relatively low sensitivity to reward-weight variation under the evaluated conditions. The close performance of RW3 and RW4 suggests that the proposed framework does not rely on a narrowly optimized reward setting but can maintain strong energy-saving performance across nearby multi-objective preference structures.
Despite these promising results, several limitations should be noted. The evaluation is based on a simulation environment and does not fully capture real-world complexities such as occupant behavior, internal heat gains, and detailed physical system dynamics. In addition, although the main reported experiments were evaluated using five random seeds to improve statistical robustness and reproducibility, maintenance effects are not explicitly modeled through detailed degradation dynamics. Furthermore, real-world deployment would require integration with IoT sensing infrastructure and efficient real-time control implementation.
Future work will focus on real-world validation, including hardware-in-the-loop testing and deployment in residential environments. Extensions of the framework to include explicit degradation modeling, adaptive reward-weight scheduling, dynamic electricity pricing, and multi-building or multi-agent coordination will further enhance its applicability and scalability. Overall, the findings demonstrate that reinforcement learning-based control offers a promising and practical solution for intelligent, adaptive, and energy-efficient HVAC operation in residential buildings, and these results highlight the practical potential of reinforcement learning as a scalable and adaptive control paradigm for next-generation residential energy systems.

Author Contributions

Conceptualization, N.K. and P.K.; Methodology, N.K. and P.K.; Software, N.K. and P.K.; Investigation, N.K., P.K. and O.T.; Writing—original draft, N.K. and P.K.; Writing-review and editing, O.T.; Supervision, P.K. and O.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author or https://github.com/natdanai-ki/EnhancedDQN, accessed on 4 March 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. IEA. World Energy Outlook 2025; IEA: Paris, France, 2025. [Google Scholar]
  2. González-Torres, M.; Pérez-Lombard, L.; Coronel, J.F.; Maestre, I.R.; Yan, D. A review on buildings energy information: Trends, end-uses, fuels and drivers. Energy Rep. 2022, 8, 626–637. [Google Scholar] [CrossRef]
  3. IEA. Staying Cool Without Overheating the Energy System. Available online: https://www.iea.org/commentaries/staying-cool-without-overheating-the-energy-system (accessed on 10 October 2025).
  4. Poolsawat, K.; Tachajapong, W.; Prasitwattanaseree, S.; Wongsapai, W. Electricity consumption characteristics in Thailand residential sector and its saving potential. Energy Rep. 2020, 6, 337–343. [Google Scholar] [CrossRef]
  5. Energy Policy and Planning Office (EPPO). Thailand Energy Statistics Report 2025; Energy Policy and Planning Office (EPPO): Bangkok, Thailand, 2025; ISBN 978-616-8040-55-3.
  6. Energy Policy and Planning Office (EPPO). Energy Forecast 2023; Energy Policy and Planning Office (EPPO): Bangkok, Thailand, 2024.
  7. Weiland, S.; Hickmann, T.; Lederer, M.; Schwindenhammer, S. The 2030 agenda for sustainable development: Transformative change through the sustainable development goals? Politics Gov. 2021, 9, 90–95. [Google Scholar] [CrossRef]
  8. South, D.W.; Alpay, S. Analyzing global progress on United Nations sustainable development goals 7 and 13—Clean energy and climate action. Clim. Energy 2025, 41, 23–32. [Google Scholar] [CrossRef]
  9. Rossy, M.; Mushfiq, D.; Huynh, P. Effect of Filtration and Self-Cleaning Filter in Air Conditioning. In Proceedings of the 24th Australasian Fluid Mechanics Conference—AFMC2024, Canberra, Australia, 1–5 December 2024. [Google Scholar]
  10. Kumar, A.; Tomar, T.; Viral, R.; Asija, D. State of the art on automatic cleaning system for split air conditioners. In Proceedings of the International Conference on Emerging Technologies in Engineering and Science (ICETES-2023), Kanchikacherla, India, 11–12 August 2023; pp. 204–228. [Google Scholar]
  11. Fadhila, A. Analysis of Reliability Centered Maintenance of Air Conditioning Facilities. J. Ind. Eng. Educ. 2024, 2, 14–30. [Google Scholar]
  12. Ardyanny Mukty, Y.; Shimazaki, Y.; Tajima, M. Comparative Analysis between Field Study and ASHRAE-55 Adaptive Model of Thermal Comfort Perception in a Residential Building in Jakarta, Indonesia. In Proceedings of the 5th Asia Conference of the International Building Performance Simulation Association (ASim 2024), Osaka, Japan, 8–10 December 2024; pp. 539–546. [Google Scholar]
  13. Mustapha, T.D.; Hassan, A.S.; Khozaei, F.; Onubi, H.O. Examining thermal comfort levels and ASHRAE Standard-55 applicability: A case study of free-running classrooms in Abuja, Nigeria. Indoor Built Environ. 2024, 33, 8–22. [Google Scholar] [CrossRef]
  14. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  15. Dayan, P.; Watkins, C. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar]
  16. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  17. Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2016. [Google Scholar]
  18. Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
  19. Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
  20. Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 6–11 August 2017; pp. 1352–1361. [Google Scholar]
  21. Hou, Z.; Zhang, K.; Wan, Y.; Li, D.; Fu, C.; Yu, H. Off-policy maximum entropy reinforcement learning: Soft actor-critic with advantage weighted mixture policy (sac-awmp). arXiv 2020, arXiv:2002.02829. [Google Scholar]
  22. Yu, L.; Qin, S.; Zhang, M.; Shen, C.; Jiang, T.; Guan, X. A review of deep reinforcement learning for smart building energy management. IEEE Internet Things J. 2021, 8, 12046–12063. [Google Scholar] [CrossRef]
  23. Fu, Q.; Han, Z.; Chen, J.; Lu, Y.; Wu, H.; Wang, Y. Applications of reinforcement learning for building energy efficiency control: A review. J. Build. Eng. 2022, 50, 104165. [Google Scholar] [CrossRef]
  24. Wang, X.; Wang, P.; Huang, R.; Zhu, X.; Arroyo, J.; Li, N. Safe deep reinforcement learning for building energy management. Appl. Energy 2025, 377, 124328. [Google Scholar] [CrossRef]
  25. Alam, S.M.; Ali, M.H. An Overview of State-of-the-Art Research on Smart Building Systems. Electronics 2025, 14, 2602. [Google Scholar] [CrossRef]
  26. Ghofrani, A.; Nazemi, S.D.; Jafari, M.A. HVAC load synchronization in smart building communities. Sustain. Cities Soc. 2019, 51, 101741. [Google Scholar] [CrossRef]
  27. Dong, B.; Prakash, V.; Feng, F.; O’Neill, Z. A review of smart building sensing system for better indoor environment control. Energy Build. 2019, 199, 29–46. [Google Scholar] [CrossRef]
  28. Masucci, D.; Foglietta, C.; Panzieri, S.; Pizzuti, S. Enhancing the smart building supervisory system effectiveness. Intell. Build. Int. 2022, 14, 564–580. [Google Scholar] [CrossRef]
  29. Geraldo Filho, P.; Villas, L.A.; Gonçalves, V.P.; Pessin, G.; Loureiro, A.A.; Ueyama, J. Energy-efficient smart home systems: Infrastructure and decision-making process. Internet Things 2019, 5, 153–167. [Google Scholar] [CrossRef]
  30. Manzini, R.; Regattieri, A.; Pham, H.; Ferrari, E. Introduction to Maintenance in Production Systems. Maint. Ind. Syst. 2010, 65–85. [Google Scholar]
  31. Zonta, T.; Da Costa, C.A.; da Rosa Righi, R.; de Lima, M.J.; da Trindade, E.S.; Li, G.P. Predictive maintenance in the Industry 4.0: A systematic literature review. Comput. Ind. Eng. 2020, 150, 106889. [Google Scholar] [CrossRef]
  32. Bokrantz, J.; Skoogh, A.; Berlin, C.; Stahre, J. Maintenance in digitalised manufacturing: Delphi-based scenarios for 2030. Int. J. Prod. Econ. 2017, 191, 154–169. [Google Scholar] [CrossRef]
  33. Khoshafian, S.; Rostetter, C. Digital prescriptive maintenance. In BPM Everywhere: Internet of Things, Process of Everything; Future Strategies Inc.: Pompano Beach, FL, USA, 2015; pp. 1–20. [Google Scholar]
  34. Ansari, F.; Glawar, R.; Nemeth, T. PriMa: A prescriptive maintenance model for cyber-physical production systems. Int. J. Comput. Integr. Manuf. 2019, 32, 482–503. [Google Scholar] [CrossRef]
  35. Russell, S.J.; Norvig, P. Artificial Intelligence: A Modern Approach, Global Edition, 4th ed.; Pearson: Harlow, UK, 2021. [Google Scholar]
  36. Mitchell, T. Machine Learning; McGraw-Hill: New York, NY, USA, 1997. [Google Scholar]
  37. Dasgupta, A.; Nath, A. Classification of machine learning algorithms. Int. J. Innov. Res. Adv. Eng. 2016, 3, 6–11. [Google Scholar]
  38. Nguyen, V.-T.; Do, P.; Vosin, A.; Iung, B. Artificial-intelligence-based maintenance decision-making and optimization for multi-state component systems. Reliab. Eng. Syst. Saf. 2022, 228, 108757. [Google Scholar] [CrossRef]
  39. Ogunfowora, O.; Najjaran, H. Reinforcement and deep reinforcement learning-based solutions for machine maintenance planning, scheduling policies, and optimization. J. Manuf. Syst. 2023, 70, 244–263. [Google Scholar] [CrossRef]
  40. Rocchetta, R.; Bellani, L.; Compare, M.; Zio, E.; Patelli, E. A reinforcement learning framework for optimal operation and maintenance of power grids. Appl. Energy 2019, 241, 291–301. [Google Scholar] [CrossRef]
  41. Kiewwath, N. Developing a Reinforcement Learning-Based Model for Prescriptive Maintenance in Air Conditioning Systems in Thailand. In Proceedings of the 7th International Conference on Culture Technology 2024, Daegu, Republic of Korea, 23–26 October 2024; pp. 189–194. [Google Scholar]
  42. Kiewwath, N. Machine Learning Approaches for Effective Energy Forecasting and Management: A Case Study of a Building at Chiang Mai University. In Proceedings of the 6th International Conference on Culture Technology 2023, Sunway City, Malaysia, 1–4 December, 2023; pp. 81–87. [Google Scholar]
  43. Talami, R.; Dawoodjee, I.; Ghahramani, A. Demystifying energy savings from dynamic temperature setpoints under weather and occupancy variability. Energy Built Environ. 2024, 5, 878–888. [Google Scholar] [CrossRef]
  44. Shi, Z.; Zheng, R.; Zhao, J.; Shen, R.; Gu, L.; Liu, Y.; Wu, J.; Wang, G. Towards various occupants with different thermal comfort requirements: A deep reinforcement learning approach combined with a dynamic PMV model for HVAC control in buildings. Energy Convers. Manag. 2024, 320, 118995. [Google Scholar] [CrossRef]
  45. Rizvi, S.A.A.; Pertzborn, A.J.; Lin, Z. Development of a bias compensating Q-learning controller for a multi-zone HVAC facility. IEEE/CAA J. Autom. Sin. 2023, 10, 1704–1715. [Google Scholar] [CrossRef]
  46. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
  47. Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation. arXiv 2015, arXiv:1506.02157. [Google Scholar]
  48. Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2017; Volume 30. [Google Scholar]
Figure 1. Two-stage evaluation pipeline for the proposed Enhanced DQN. The first stage uses LunarLander-v3 as a diagnostic benchmark with five random seeds and 2000 episodes to assess learning curves, mean ± standard deviation, and convergence indicators. The second stage applies the framework to a 365-day split-type air-conditioning simulation using hourly weather and PM10 input profiles in SplitTypeEnv, with outputs including annual energy use, comfort, filter health, and savings relative to the baseline.
Figure 1. Two-stage evaluation pipeline for the proposed Enhanced DQN. The first stage uses LunarLander-v3 as a diagnostic benchmark with five random seeds and 2000 episodes to assess learning curves, mean ± standard deviation, and convergence indicators. The second stage applies the framework to a 365-day split-type air-conditioning simulation using hourly weather and PM10 input profiles in SplitTypeEnv, with outputs including annual energy use, comfort, filter health, and savings relative to the baseline.
Sustainability 18 03578 g001
Figure 2. Structure of the SplitTypeEnv simulation environment and information flow at each hourly decision step. Exogenous inputs, including outdoor temperature, outdoor humidity, PM10 concentration, and time features, are processed by the simulation environment to update indoor dynamics, energy proxy, and filter health. The RL agent receives the observed state, selects a control action, and interacts with the environment in a closed loop. Reward computation is based on energy, comfort, and maintenance penalties for multi-objective learning.
Figure 2. Structure of the SplitTypeEnv simulation environment and information flow at each hourly decision step. Exogenous inputs, including outdoor temperature, outdoor humidity, PM10 concentration, and time features, are processed by the simulation environment to update indoor dynamics, energy proxy, and filter health. The RL agent receives the observed state, selects a control action, and interacts with the environment in a closed loop. Reward computation is based on energy, comfort, and maintenance penalties for multi-objective learning.
Sustainability 18 03578 g002
Figure 3. Learning performance comparison on LunarLander-v3 across five random seeds. Solid lines represent the mean episodic return, and shaded regions indicate ±1 standard deviation. The dashed horizontal line at 200 denotes the solved threshold for reference.
Figure 3. Learning performance comparison on LunarLander-v3 across five random seeds. Solid lines represent the mean episodic return, and shaded regions indicate ±1 standard deviation. The dashed horizontal line at 200 denotes the solved threshold for reference.
Sustainability 18 03578 g003
Figure 4. Reward distribution on LunarLander-v3 across five random seeds. Boxplots summarize the episodic return distribution over the final training episodes for each algorithm, showing the median, interquartile range, whiskers, and outliers.
Figure 4. Reward distribution on LunarLander-v3 across five random seeds. Boxplots summarize the episodic return distribution over the final training episodes for each algorithm, showing the median, interquartile range, whiskers, and outliers.
Sustainability 18 03578 g004
Figure 5. Annual energy consumption comparison between the fixed 25 °C baseline and the two strongest reward-weight configurations of the proposed Enhanced DQN, RW3 and RW4, reported as mean ± standard deviation across five random seeds (n = 5).
Figure 5. Annual energy consumption comparison between the fixed 25 °C baseline and the two strongest reward-weight configurations of the proposed Enhanced DQN, RW3 and RW4, reported as mean ± standard deviation across five random seeds (n = 5).
Sustainability 18 03578 g005
Figure 6. Daily energy consumption and PM10 concentration under the RW4 configuration over a 365-day simulation period, illustrating environmentally adaptive control behavior under time-varying air-quality conditions.
Figure 6. Daily energy consumption and PM10 concentration under the RW4 configuration over a 365-day simulation period, illustrating environmentally adaptive control behavior under time-varying air-quality conditions.
Sustainability 18 03578 g006
Figure 7. Monthly energy consumption under different reward-weight configurations (RW1–RW4) compared with the baseline strategy, demonstrating preserved seasonal patterns and consistent energy reduction across the annual cycle.
Figure 7. Monthly energy consumption under different reward-weight configurations (RW1–RW4) compared with the baseline strategy, demonstrating preserved seasonal patterns and consistent energy reduction across the annual cycle.
Sustainability 18 03578 g007
Table 1. Hyperparameter settings for baselines and the proposed Enhanced DQN in the LunarLander-v3 benchmark.
Table 1. Hyperparameter settings for baselines and the proposed Enhanced DQN in the LunarLander-v3 benchmark.
CategoryDQN/DDQN/Dueling DQN (Baselines)Enhanced DQN (Proposed)
EnvironmentLunarLander-v3 (default configuration)LunarLander-v3 (fallback v2)
Episodes2000 (default configuration)2000 per seed
Random seedsconfigurable via CLI (default 0)fixed list (0, 1, 2, 3, 4)
Discount factor (γ)0.990.99
Python 3.13; PyTorch 2.8; Gymnasium 1.2; NumPy/pandas 2.3,Python 3.13; PyTorch 2.8; Gymnasium 1.2; NumPy/pandas 2.3
Optimizer/LRlr = 1 × 10−3lr = 3 × 10−4
Batch size64128
Replay bufferUniform replay (100,000)PER (100,000)
PER parametersα = 0.6, β = 0.4
Explorationε-greedy (ε_decay = 0.995)Softmax sampling (T = 0.5)
Network capacity128 × 128256 × 256 + MC Dropout (p = 0.1)
Multi-step returnsn = 3
Target updatesoft update (τ = 1 × 10−3)hard update every 50 episodes
Output loggingCSV: episode, reward, lossCSV: per seed
Table 2. Reward weight configurations (RW1–RW4) and design rationale for multi-objective trade-off analysis.
Table 2. Reward weight configurations (RW1–RW4) and design rationale for multi-objective trade-off analysis.
Config IDEnergy WeightMaintenance WeightComfort WeightDesign IntentExpected Policy Behavior
RW1 (Balanced)1.01.01.0Balanced multi-objective controlConservative; prioritizes comfort and system health
RW2
(Energy-Dominant)
4.02.01.0Energy efficiency as primary objective with soft constraintsReduced energy with acceptable comfort degradation
RW3 (Aggressive Energy)6.02.00.8Strong energy minimization under relaxed comfort constraintsHigher energy saving with increased comfort trade-off
RW4 (High Energy Weight)15.05.01.0Stress-testing reward sensitivity and policy extremesStrong emphasis on energy reduction; possible reward saturation effects
Table 3. Experimental design matrix for diagnostic benchmark and applied case study evaluation.
Table 3. Experimental design matrix for diagnostic benchmark and applied case study evaluation.
StageScenarioSeedsHorizonOutputs Reported in ResultsResults Section
Stage 1LunarLander-v3 benchmark52000 episodesLearning curves mean ± SD; final-100 distributionSection 6.1
Stage 2AC simulation (RW3 and RW4 highlighted)58760 steps (365 days)Annual kWh, savings%, narrative analysisSection 6.2
Stage 2PM10 adaptation (RW4)5365 daysDaily energy vs. PM10 time seriesSection 6.3
Stage 2Reward sensitivity (RW1–RW4)5365 daysMonthly energy vs. baseline; annual summary across RWSection 6.4
Table 4. Annual energy consumption comparison between baseline and Enhanced DQN (mean ± standard deviation, n = 5 seeds).
Table 4. Annual energy consumption comparison between baseline and Enhanced DQN (mean ± standard deviation, n = 5 seeds).
MethodAnnual Energy (kWh)Energy Reduction (kWh)Energy Saving (%)
Baseline5116.22
RW14521.20 ± 50.10595.02 ± 50.1011.63 ± 0.98
RW24456.61 ± 20.16651.08 ± 20.1612.90 ± 0.45
RW34440.69 ± 27.90675.54 ± 27.9013.20 ± 0.55
RW44440.03 ± 37.50676.20 ± 37.5013.22 ± 0.72
Note: Energy reduction and saving percentage are calculated relative to the fixed 25 °C baseline (5116.22 kWh).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kiewwath, N.; Khuwuthyakorn, P.; Thinnukool, O. Energy-Efficient and Maintenance-Aware Control of a Residential Split-Type Air Conditioner Using an Enhanced Deep Q-Network. Sustainability 2026, 18, 3578. https://doi.org/10.3390/su18073578

AMA Style

Kiewwath N, Khuwuthyakorn P, Thinnukool O. Energy-Efficient and Maintenance-Aware Control of a Residential Split-Type Air Conditioner Using an Enhanced Deep Q-Network. Sustainability. 2026; 18(7):3578. https://doi.org/10.3390/su18073578

Chicago/Turabian Style

Kiewwath, Natdanai, Pattaraporn Khuwuthyakorn, and Orawit Thinnukool. 2026. "Energy-Efficient and Maintenance-Aware Control of a Residential Split-Type Air Conditioner Using an Enhanced Deep Q-Network" Sustainability 18, no. 7: 3578. https://doi.org/10.3390/su18073578

APA Style

Kiewwath, N., Khuwuthyakorn, P., & Thinnukool, O. (2026). Energy-Efficient and Maintenance-Aware Control of a Residential Split-Type Air Conditioner Using an Enhanced Deep Q-Network. Sustainability, 18(7), 3578. https://doi.org/10.3390/su18073578

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop