1. Introduction
Over the past several decades, global electricity demand has increased continuously. This growing demand for electricity arises from several factors, including economic expansion, industrial growth, the increase in population, and urbanization [
1]. In the building sector, electricity demand has also been increasing steadily, contributing a substantial share of final energy consumption and energy-related CO
2 emissions [
2]. A report by the International Energy Agency (IEA) states that electricity use for space cooling by air conditioning systems in buildings accounts for about 20% of total building electricity consumption, and this share is expected to keep increasing as global average temperatures rise and average incomes grow, enabling more households to purchase and use air conditioners [
3].
For Thailand, which is located in a tropical region, space cooling in residential buildings is essential. In the household sector, split-type air conditioners are very popular because they are relatively inexpensive and easy to install. However, a survey of residential electricity consumption in Thailand in 2018 reported that the total residential electricity consumption was 35,624 GWh, of which air conditioners accounted for 26.50%, followed by refrigerators at 19.33% [
4]. In addition, national energy statistics report that residential electricity consumption has increased continuously from 53,747 GWh in 2022 to 57,726 GWh in 2023, and in 2024 residential electricity consumption increased by 7.7% compared to the previous year. These reports conclude that electricity demand in the residential sector is likely to keep increasing due to urban expansion and hotter climatic conditions [
5,
6]. Efficient operation and proper maintenance of air conditioners can reduce electricity consumption and greenhouse gas emissions, which aligns with the Sustainable Development Goals (SDGs), in particular SDG 7 on affordable and clean energy and SDG 13 on climate action [
7,
8].
In practice, the performance and efficiency of air conditioners do not depend only on the thermostat setpoint and the operating schedule. There are also external factors such as global warming, which increases global temperature trends, climate variability, and air pollution from fine particulate matter such as PM10. These external factors affect the performance of air conditioners in several ways. Higher temperatures and climate variability increase cooling demand, while particulate pollution can promote dust accumulation on air filters and heat exchanger surfaces, which may reduce operating efficiency and increase cleaning and maintenance requirements over time [
9,
10].
In conventional practice, air conditioners are usually operated with a fixed temperature setpoint and scheduled operating hours based on government energy saving campaigns, which primarily focus on user thermal comfort. For maintenance, preventive maintenance is typically performed according to a predefined schedule, and if the air conditioner fails, a technician is called to repair it, which is corrective maintenance [
11]. These traditional approaches cannot adequately cope with dynamically changing weather conditions, temporal variations in fine particulate pollution, or seasonal differences in climate. They also do not explicitly adapt to seasonal conditions while simultaneously considering user comfort, energy savings, and long-term maintenance of air conditioner components. Thermal comfort in residential environments is commonly discussed with reference to ASHRAE-55-based studies and tropical residential field evidence, which often indicate acceptable or neutral comfort temperatures in the mid-20 °C range [
12,
13]. These limitations motivate the need for energy-efficient and maintenance-aware control policies that can adapt to continuously changing environmental conditions, respond to seasonal dust levels, and support maintenance activities so that the air conditioner can operate for a longer lifetime.
Reinforcement Learning (RL) is a principled framework for sequential decision-making under uncertainty [
14]. In an RL formulation, an agent interacts with a stochastic environment. In each state, the agent selects an action and then receives a reward from that action. The agent therefore learns to make decisions under different situations to obtain the best long-term outcome. Problems that use reinforcement learning are typically formulated as a Markov Decision Process (MDP). The main objective is to learn a policy that maximizes the expected discounted return over the long run. Q-learning, or classical tabular Q-learning, can be regarded as a stochastic approximation scheme for solving the Bellman optimality equation, and it is well known that it converges to the optimal action value function under standard conditions on the learning rate, exploration, and the stationarity of the Markov decision process [
15]. However, when Q-learning is combined with nonlinear function approximation, bootstrapping, and off-policy learning, these three elements together create what is known as the “deadly triad”, which can lead to instability and divergence [
14].
The Deep Q-Network (DQN) architecture mitigates some of these issues by using a deep neural network to approximate the action value function, together with experience replay and a target network to stabilize training [
16]. DQN has achieved human level performance on Atari game benchmarks, but it still suffers from well-known limitations such as overestimation bias in Q-values, sensitivity to hyperparameters, and brittle exploration based on ε-greedy policies. To address these weaknesses, several extensions have been proposed. Double Q-learning decouples action selection from target evaluation to reduce overestimation bias [
17]; dueling network architectures separate state-value and advantage estimation, improving representation learning in states where many actions have similar returns [
18]; prioritized experience replay (PER) samples transitions with large temporal-difference (TD) errors more frequently, accelerating the propagation of informative updates [
19]; and multi-step returns incorporate rewards from several future steps, improving credit assignment in long horizon tasks [
14]. In addition, Bayesian regularization can be applied to neural network parameters to penalize overly complex models and reduce overfitting, which is particularly important when training deep value functions on non-stationary or noisy data.
Another line of research introduces entropy-based regularization into the RL objective. In maximum-entropy RL and soft Q-learning, the agent seeks to maximize not only the expected return but also the expected policy entropy, leading to “soft” Bellman operators and stochastic policies that are both high-performing and exploratory [
20]. This perspective underlies Soft Actor-Critic and related algorithms, which have demonstrated improved robustness and sample efficiency in continuous control [
21]. When entropy regularization is incorporated into value-based methods, it can be interpreted as a mechanism for stabilizing exploration and avoiding premature convergence to deterministic policies, especially in multi objective environments where several actions may have comparable returns. In the context of air conditioning control, entropy-based exploration control is appealing because it can encourage diverse operating patterns during training, while Bayesian regularization and multi-step returns support more stable learning of long-horizon energy and maintenance trade-offs.
Deep RL has been applied increasingly to building and HVAC control, with studies showing that well-designed agents can reduce energy consumption while maintaining thermal comfort in both simulated and real-world environments [
22,
23,
24]. Existing work includes applications to chiller plants, variable air volume systems, and home energy management, often reporting significant energy savings relative to rule-based baselines. However, relatively few studies focus specifically on residential split-type air conditioners in hot and humid climates, and most prior work primarily targets energy efficiency and comfort without explicitly modeling the interaction between control behavior, dust related performance degradation, and maintenance-awareness. Moreover, many implementations rely on standard DQN variants without systematically combining multiple stability-enhancing mechanisms or explicitly integrating Bayesian and entropy regularization into the learning process.
To address these gaps, this paper proposes an Enhanced Deep Q-Network (Enhanced DQN) for energy-efficient and maintenance-aware control of residential split-type air conditioners. The proposed algorithm integrates Double Q-learning, dueling network architecture, prioritized experience replay, multi-step returns, Bayesian regularization, and entropy-based exploration control into a unified value-based RL framework [
16,
17,
18,
19,
20]. From a design perspective, these components are chosen to improve learning stability, support multi-objective operation, and make the agent suitable for long horizon tasks in dynamically changing environments. The Enhanced DQN is first evaluated through a diagnostic benchmark on the LunarLander-v3 environment using multiple random seeds to analyze convergence behavior and policy variance. It is then subjected to an applied evaluation by simulating the control of a 15,000 BTU residential split-type air conditioner, operating hourly for 365 days under environmental conditions that include both Thailand’s seasonal weather and varying PM10 dust levels.
The experimental results show that the proposed control approach can reduce annual electricity consumption by up to 13.22% compared with a constant temperature setting of 25 °C, demonstrating tangible energy savings relative to a widely recommended reference setpoint. In addition, the learned policy exhibits maintenance-aware control behavior, adjusting operating patterns in response to dust conditions and implicitly supporting better maintenance planning. These findings highlight the potential of reinforcement learning to promote intelligent energy management and sustainable maintenance for residential cooling systems, in line with SDG 7 and SDG 13 [
7,
8].
The main contributions of this paper are threefold:
We formulate residential split-type air conditioner operations in a tropical country as an energy-efficient and maintenance-aware RL problem that explicitly accounts for seasonal weather variation, PM10 dust levels, and long-horizon interactions between control actions, energy use, and maintenance-related behavior.
We develop an Enhanced DQN that combines Double Q-learning, dueling network architecture, prioritized experience replay, multi-step returns, Bayesian regularization, and entropy-based exploration control into a unified stability-oriented framework suitable for multi-objective, long-horizon control.
We conduct a two-stage evaluation involving a diagnostic benchmark on LunarLander-v3 using multiple random seeds to study learning stability and policy variance, followed by a 365-day simulation of a 15,000 BTU residential split-type air conditioner under realistic Thai weather and PM10 conditions, demonstrating annual energy savings of approximately 13.2% relative to a fixed 25 °C baseline.
The remainder of this paper is organized as follows.
Section 2 reviews related work on deep RL for building and HVAC control and on stability-enhancing extensions of DQN.
Section 3 presents the problem formulation, system model, and RL setup for residential split-type air conditioner control.
Section 4 describes the proposed Enhanced DQN architecture and its Bayesian and entropy regularization mechanisms.
Section 5 details the experimental design and reports the results for both the LunarLander-v3 and air-conditioning simulations.
Section 6 discusses the implications, limitations, and potential extensions of this work. Finally,
Section 7 concludes the paper and outlines directions for future research.
2. Related Work
2.1. Global Cooling Demand, SDGs, Split-Type Air Conditioners, and Smart Buildings
As discussed in
Section 1, electricity use in the building sector has been increasing steadily and already accounts for a substantial share of final energy consumption and energy related CO
2 emissions worldwide [
1,
2]. According to the International Energy Agency (IEA), space cooling is one of the fastest growing end uses of electricity. Air conditioners and electric fans together consume roughly one fifth of all electricity used in buildings and cooling demand could increase by more than a factor of two to three by 2050 in the absence of additional efficiency measures [
1,
3]. This trend poses a direct challenge to the Sustainable Development Goals (SDGs), particularly SDG 7 on affordable and clean energy and SDG 13 on climate action, because increasing peak electricity demand from air conditioning requires additional investment in power generation and network infrastructure and contributes to higher greenhouse gas emissions [
7,
8].
In tropical countries such as Thailand, residential space cooling is essential, and split-type air conditioners are widely adopted because they are relatively inexpensive and easy to install. National surveys indicate that air conditioners account for the largest share of household electricity use, and residential demand has continued to rise with urbanization, higher incomes, and warmer climatic conditions [
4,
5,
6]. In addition, external stressors such as climate variability and particulate pollution can degrade air conditioner performance. Higher ambient temperatures increase cooling loads, while particulate matter accelerates filter and coil fouling, which reduces airflow and heat transfer and forces fans and compressors to work harder [
9,
10].
In practice, residential air conditioners are still commonly operated using simple strategies, such as fixed temperature setpoints recommended by government energy-saving campaigns and coarse on/off schedules. Maintenance is typically performed using time-based preventive routines, for example, periodic cleaning every few months, while corrective maintenance is carried out when failures occur [
11]. Such traditional strategies are not designed to cope with dynamically changing weather, seasonal PM10 patterns, or long-horizon interactions between control actions, energy use, and equipment degradation. These limitations motivate energy-efficient and maintenance-aware control policies that adapt to environmental dynamics and support more proactive maintenance.
In parallel, smart buildings have emerged as an important response to these challenges, integrating intelligent energy management, automated building systems, and IoT-based sensing to support adaptive control of HVAC, lighting, and other building services [
25]. In this context, HVAC systems are often regarded as the energy heart of the building, as they represent a dominant controllable load that directly affects comfort and electricity costs [
26]. Smart building and smart home platforms rely on IoT devices and sensors to measure variables such as temperature, humidity, indoor air quality, occupancy, and equipment states, thereby providing the data needed for real-time control and optimization [
27].
Building Management Systems (BMS) supervise and control subsystems in commercial and institutional buildings by combining sensor data with algorithms for setpoint adjustment, fault detection, and alarm generation [
28]. At the residential scale, energy-efficient smart home systems emphasize decision-making processes that reduce energy use and costs while maintaining comfort and responding to occupant behavior [
29]. In this study, these developments matter for two reasons. First, they provide sensing and communication infrastructure to observe the operation of individual split-type air conditioners in real time. Second, they enable tighter integration of reinforcement learning with predictive and prescriptive maintenance by treating operational signals as both control inputs and health indicators for maintenance analytics.
2.2. Predictive and Prescriptive Maintenance for HVAC and Cooling Systems
Maintenance is a critical function in asset life-cycle management because it affects availability, reliability, and total life-cycle cost. Manzini et al. define maintenance as a set of technical, administrative, and managerial actions performed throughout an asset’s life to preserve it in, or restore it to, a state in which it can perform its required function [
30]. Maintenance strategies range from reactive run-to-failure approaches to time-based preventive maintenance, and further to condition-based maintenance, predictive maintenance, and prescriptive maintenance.
In the HVAC domain, the literature on PdM has grown rapidly. Systematic reviews such as that of Zonta et al. highlight the multidisciplinary nature of PdM in the Industry 4.0 context, spanning data acquisition, feature extraction, modelling, decision support and business integration [
31]. Essakali et al. surveyed PdM algorithms applied to HVAC systems, covering air handling units (AHUs), chiller plants, cooling towers, and duct systems, and discussed approaches ranging from signal analysis and health-index construction to machine learning and deep-learning methods for fault detection and remaining useful life (RUL) prediction. Other studies develop data-driven PdM and fault-detection frameworks for whole-building HVAC systems using real sensor data, with the aim of improving energy efficiency while scheduling maintenance more effectively [
26,
30]. Zonta et al. proposed a taxonomy that classifies PdM approaches by model type, including regression models for RUL estimation, classification models for health-state or fault-type prediction, and survival models for failure probability over time, as well as by modelling paradigm, namely physics-based, knowledge-based, data-driven, and hybrid approaches [
31]. The emerging consensus is that future PdM systems will be increasingly hybrid, combining data-driven models with domain knowledge to achieve robust performance under real world constraints.
Beyond PdM, prescriptive maintenance (PsM) has been proposed as the most advanced stage of knowledge-based maintenance. Instead of stopping at when will the asset fail, PsM explicitly addresses what should be done, when, and how to optimize cost, risk and system performance. Bokrantz et al. analyzed the role of maintenance in digitalized manufacturing and develop Delphi-based scenarios for maintenance in 2030, linking PsM to Industry 4.0 and to horizontally and vertically integrated data flows in factories [
32]. Khoshafian and Rostetter introduced the notion of digital prescriptive maintenance, emphasizing the integration of IoT, the Process of Everything, and business process management (BPM) to connect sensor data with automated maintenance decision-making [
33]. Ansari et al. proposed the PriMa framework as a prescriptive maintenance model for cyber-physical production systems, explicitly linking sensor data, models, and knowledge bases to prescriptive maintenance planning [
34].
Artificial intelligence (AI) and machine learning provide an overarching framework for methods that enable machines to exhibit human-like capabilities such as reasoning, learning, and decision-making [
35]. Within this framework, machine learning (ML) is often regarded as the core mechanism by which systems improve performance with experience. Within this broader context, Mitchell defined machine learning as the study of algorithms that enable computer programs to improve their performance through experience [
36]. Dasgupta and Nath classify ML algorithms into supervised, unsupervised and reinforcement learning, based on the availability of labels and the nature of the learning objective [
37].
In maintenance applications, ML has been used for continuous variable prediction, classification of normal and faulty states, anomaly detection using unsupervised methods and system level cost or risk estimation using deep models. Nguyen et al. developed an AI-based maintenance decision-making and optimization framework for multi-state component systems, showing that artificial neural networks (ANNs) can accurately estimate maintenance costs at the system level and that multi-agent deep RL can further improve decision efficiency [
38].
Reinforcement learning has attracted growing interest in maintenance planning and operation. Garcia and Rachelson provide a comprehensive overview of Markov decision processes (MDPs) as the mathematical foundation for sequential decision-making under uncertainty, while Sutton and Barto present RL methods ranging from tabular algorithms to deep RL [
14]. In a recent review, Ogunfowora and Najjaran survey reinforcement and deep reinforcement learning-based solutions for machine maintenance planning, scheduling policies, optimization, and reported that roughly 70% of the surveyed work uses Q-learning or DQN variants, reflecting the model-free nature of many maintenance problems [
39]. Rocchetta et al. proposed a reinforcement learning framework for optimal operation and maintenance of power grids, using prognostics and health management (PHM) information to support joint operation and maintenance decisions and to maximize expected profit under environmental uncertainty [
40].
Despite these advances, most PdM/PsM and RL-based maintenance frameworks focus on industrial machinery or large HVAC plants in commercial buildings. The control layer (which governs equipment operation and energy use) and the maintenance planning layer are usually treated separately: controllers aim to track loads or minimize energy, while PdM/PsM models use health indicators and failure probabilities to plan interventions. There is comparatively little work on maintenance-aware controllers whose objective functions natively combine energy, comfort, and maintenance stability for example, by limiting compressor start–stop frequency or adapting control patterns to dust-related performance degradation.
In the Thai context, the authors previously proposed a reinforcement learning-based prescriptive maintenance model for air-conditioning systems which formulated an RL framework to link operational data from air conditioners to maintenance decisions at the policy level [
41]. Another study applied machine learning approaches to energy forecasting and management for a university building in Chiang Mai, demonstrating that data-driven models can support building-level energy management decisions effectively [
42]. However, these earlier works focused on forecasting and policy-level maintenance planning. The present study builds on this line of research by designing an Enhanced DQN that is maintenance-aware by construction and by embedding maintenance-related stability metrics directly into the reward function of an RL controller for residential split-type air conditioners.
2.3. RL Theory, the Deadly Triad, and the Deep Q-Network Family
From a theoretical perspective, reinforcement learning problems are typically formulated as Markov decision processes (MDPs), where an agent interacts with a stochastic environment by observing states, selecting actions, and receiving rewards, with the aim of maximizing long-term discounted return. Classical tabular Q-learning can be viewed as a stochastic approximation scheme for solving the Bellman optimality equation, and it is well known to converge to the optimal action-value function under standard conditions on the learning rate, exploration, and stationarity of the MDP [
14,
15].
When Q-learning is combined with nonlinear function approximation, bootstrapping, and off-policy learning, the three elements together form the so-called deadly triad, which can cause instability or even divergence in value estimates [
14]. Mnih et al. proposed the Deep Q-Network (DQN) architecture to mitigate these issues by using a deep neural network to approximate the Q-function together with an experience replay buffer and a target network to reduce correlation in training data and stabilize parameter updates [
16]. Subsequent work introduced several extensions. Double DQN addresses overestimation bias by decoupling action selection and action evaluation in the target update [
17]. Dueling network architectures separate state-value and advantage streams, improving value estimation in states where many actions have similar returns [
18]. Prioritized experience replay (PER) samples transitions in proportion to their temporal-difference (TD) error, emphasizing informative samples and improving sample efficiency [
19]. These components form a family of increasingly stable DQN variants that have been applied in various control domains, including energy and HVAC applications [
22,
23,
24,
39,
40].
In the present work, these ideas are combined in an Enhanced DQN that integrates Double Q-learning, dueling network architecture, PER, and multi-step returns into a single value-based RL agent tailored to long-horizon, multi-objective control of split-type air conditioners.
2.4. Maximum-Entropy RL and Entropy Regularization in Q-Learning
Maximum-entropy reinforcement learning extends the traditional RL objective by augmenting the expected return with an entropy term that encourages stochastic policies with sufficient exploration [
20]. This leads to soft Bellman operators and soft Q-functions, where the value of a state–action pair reflects both its expected return and the entropy of the resulting policy. Algorithms such as Soft Q-learning and Soft Actor–Critic (SAC) have demonstrated that entropy regularization can substantially improve learning stability and sample efficiency in a wide range of continuous-control tasks [
20,
21]. SAC has become a widely adopted baseline in robotics and energy-related control problems.
Recent analytical work has linked non-uniform sampling schemes and entropy regularization to stochastic approximation theory, clarifying conditions under which the choice of sampling distribution and entropy weight preserves or improves convergence properties of Q-learning like algorithms. Although many practical applications of maximum entropy RL focus on actor–critic methods with continuous action spaces, the underlying ideas can also be adapted to value-based methods by incorporating entropy terms into the value function and the policy.
In the context of air-conditioning control, entropy-based exploration control is appealing because it can encourage the agent to explore diverse operating patterns during training, rather than prematurely collapsing to a brittle deterministic policy. Combined with Bayesian regularization of network parameters used to penalize overly complex models and mitigate overfitting in non-stationary or noisy environments and multi-step returns to improve long-horizon credit assignment, these mechanisms provide a stability-oriented design that is well suited to controlling split-type air conditioners over a full year of operation under varying weather and PM10 conditions.
2.5. Deep RL for Building and HVAC Control and Remaining Gaps
Deep RL has been increasingly applied to building energy management and HVAC control. Recent reviews by Yu et al. [
22], Fu et al. [
23] and Wang et al. [
24] summarize a broad range of RL and deep RL applications, including temperature control, chiller plant optimization, variable air volume (VAV) systems, demand response and home energy management systems (HEMS), in both simulation and real-world buildings. These studies consistently report that RL-based controllers can reduce energy consumption relative to rule-based or, in some cases, model predictive control (MPC) baselines, while maintaining or improving occupant comfort. Several previous studies have represented thermal comfort using simplified temperature-based criteria, typically expressed in terms of fixed or dynamic temperature setpoints, setpoint-tracking performance, or broad acceptable indoor temperature ranges rather than occupant-specific comfort models [
43,
44,
45]. Such temperature ranges are often selected with reference to established thermal comfort guidelines, including ASHRAE Standard 55 [
12,
13].
At the level of concrete case studies, Wang et al. applied DQN-based controllers to smart-building HVAC systems and showed that learning from operational data can reduce energy costs while maintaining comfort [
24]. Dong et al. reviewed smart building sensing systems for indoor environment control and highlighted the importance of sensor-based monitoring infrastructure in enabling adaptive HVAC operation and energy-efficient building management [
27]. Geraldo Filho et al. reviewed energy-efficient smart home systems and highlighted that both infrastructure design and decision-making processes play essential roles in enabling adaptive and energy-efficient operation in residential environments [
29].
Nevertheless, most of the existing literature focuses on large systems—such as chiller plants, VAV systems, or whole-building HVAC—or on demand response and load management, rather than on individual residential split-type air conditioners in hot and humid climates. Moreover, relatively few studies formulate the control problem as a multi-objective task that explicitly emphasizes maintenance stability, such as limiting compressor cycling or incorporating simple proxies of equipment degradation linked to dust accumulation and air quality.
In the Thai context, the authors have previously investigated machine learning-based energy forecasting and management for a university building in Chiang Mai, showing that data-driven models can support building-level energy management decisions [
42]. However, that work focused primarily on forecasting and analytical energy management, rather than direct RL-based control of individual devices.
The present study addresses these gaps along two dimensions: First, at the application level, by focusing on a residential split-type air conditioner in a tropical climate, modelling seasonal weather and PM10 dust levels explicitly within the RL environment; and Second, at the algorithmic level, by proposing an Enhanced DQN that combines Double Q-learning, dueling networks, PER, multi-step returns, Bayesian regularization, and entropy-based exploration control into a unified stability-oriented framework. The agent is trained and evaluated over a 365-day control horizon, with a reward function that balances energy consumption, thermal comfort, and maintenance-related stability metrics, thereby positioning the work at the intersection of deep RL, smart building control, and prescriptive maintenance.
In terms of reported performance, several prior studies on RL-based HVAC control have demonstrated energy savings typically ranging from approximately 5% to 20%, depending on building type, control objectives, and evaluation conditions. While direct numerical comparison is difficult due to differences in system scale, environment modeling, and experimental protocols, these results provide a contextual reference for interpreting the performance of the proposed method.
3. Problem Formulation, System Model, and Reinforcement Learning Setup
This study considers long-horizon control of a residential split-type air conditioner operating under time-varying outdoor weather and particulate matter (PM10) exposure. The controller is designed to jointly reduce electricity consumption, maintain acceptable thermal comfort, and mitigate maintenance-related degradation over an annual operating cycle. To support decision-making under stochastic transitions and delayed operational consequences, we formulate the task as a reinforcement learning (RL) problem using a Markov decision process (MDP).
3.1. Markov Decision Process Formulation
The control task is modeled as an MDP defined by the tuple , where denotes the state space, is a finite discrete action space, represents the transition dynamics, denotes the reward function, and is the discount factor. At each decision step , the agent observes , selects an action , receives a scalar reward , and transitions to the next state .
The objective is to maximize the expected discounted return:
where
denotes the episode length and
is the control policy. In this study, each episode corresponds to a full-year simulation with hourly resolution (
= 8760 time steps). This formulation enables the agent to learn long-horizon control policies under seasonal variability, time-varying environmental conditions, and delayed system responses. The exact horizon definition and implementation details are provided in
Section 5 [
14].
3.2. State Representation
The state representation is designed to capture the thermal environment, air-quality exposure, equipment condition, and temporal context relevant to air-conditioning control. In this study, the state vector consists of eight variables:
where
: indoor air temperature (°C)
: indoor relative humidity (%)
: outdoor air temperature (°C)
: outdoor relative humidity (%)
PM10: particulate matter concentration (µg/m3)
: filter health index (normalized between 0 and 1)
: hour of day (0–23)
: day of year (1–365)
Indoor temperature is approximated using a simplified thermal response model driven by outdoor conditions and cooling actions. As a result, indoor temperature evolves dynamically rather than being fixed or directly equal to outdoor temperature. Indoor humidity is included as a state variable but is not directly penalized in the reward function, allowing the model to focus on temperature-based comfort control. The inclusion of PM10 and filter health enables the agent to account for environmental air quality and equipment degradation. Filter health is modeled as a simplified PM10-driven degradation process, where filter health decreases proportionally to PM10 concentration. This abstraction reflects the effect of dust accumulation on airflow reduction and increased system load. All continuous variables are normalized to the range [0, 1] using min–max scaling to improve learning stability and prevent dominance of variables with larger numerical scales.
This formulation should be interpreted as an approximate Markov representation under partial observability. While the true system may violate strict Markov assumptions due to unobserved factors such as occupancy and internal heat gains, the inclusion of temporal variables and observable thermal responses provides sufficient information for stable policy learning in practice.
3.3. Discrete Action Space
To ensure compatibility with value-based deep reinforcement learning, the action space is defined as a discrete set of feasible air conditioner control commands. Each action corresponds to a composite operating configuration:
where
: airflow direction control
: temperature setpoint (mapped to 18–31 °C)
: fan speed level (low, medium, high)
: operation mode (5 discrete modes)
The resulting action space is implemented as a MultiDiscrete control structure, with a total of 420 possible action combinations. This formulation corresponds to a discrete combination of control variables rather than a continuous control space. In the current implementation, temperature setpoint and fan speed directly influence thermal dynamics and energy consumption, while other dimensions are retained for extensibility and future system integration. Although some action dimensions have limited immediate influence in the simplified environment, they are retained for extensibility. Learning efficiency is supported through prioritized experience replay and multi-step returns to mitigate potential redundancy effects.
Although some action dimensions are not strongly coupled to the current simplified dynamics, preliminary experiments indicated that the learning process remains stable without significant degradation in convergence behavior. This suggests that the adopted architecture and sampling mechanisms are sufficient to mitigate potential inefficiencies arising from action-space redundancy.
3.4. Reward Scalarization for Multi-Objective Control
The HVAC control problem involves competing objectives, including reducing energy consumption, maintaining thermal comfort, and limiting maintenance-related degradation. A scalar reward function is therefore used to aggregate these objectives into a single learning signal:
where
: energy consumption term
: comfort penalty defined as the absolute deviation between indoor temperature and the target setpoint
: maintenance-related degradation penalty
The coefficients define the relative importance of each objective. The comfort term is based on temperature deviation from a target setpoint, reflecting common practice in residential cooling control. Humidity is not directly included in the reward function to maintain model simplicity and interpretability. The maintenance term is modeled using a filter degradation proxy rather than an explicit physical dust accumulation model. This abstraction captures the key effect of PM10 exposure on system performance, where higher particulate concentration leads to faster degradation, reduced airflow efficiency, and increased energy demand. To ensure meaningful trade-offs, all reward components are scaled to comparable magnitudes. Multiple reward-weight configurations are evaluated to analyze sensitivity and robustness of the learned policy.
3.5. Decision Interval and Episode Definition
The controller operates at a fixed hourly decision interval. Each episode represents a full-year operating cycle consisting of 8760-time steps. This long-horizon formulation enables the agent to learn policies that account for:
seasonal weather variability
time-varying PM10 concentration
cumulative degradation effects
The full specification of environmental inputs, simulation assumptions, and evaluation protocol is described in
Section 5. This formulation provides a practical and computationally efficient framework for learning adaptive air-conditioning control policies under realistic environmental conditions.
6. Results
This section reports the experimental results of the proposed Enhanced Deep Q-Network (Enhanced DQN). The evaluation comprises two parts. First, LunarLander-v3 is used as a diagnostic benchmark to examine learning dynamics, convergence behavior, and stability under stochastic training conditions. Second, the main application results for split-type air conditioner energy optimization are presented, which constitute the primary contribution of this study.
6.1. Diagnostic Benchmark Evaluation on LunarLander-v3
To analyze the learning characteristics of the proposed method under controlled conditions, a comparative benchmark evaluation was conducted on the LunarLander-v3 environment. This benchmark is intentionally used as a diagnostic tool rather than the target application. The goal is to examine learning dynamics, convergence behavior, and stability under stochastic training conditions, before transferring the method to the one-year split-type air conditioning control task. Unless otherwise stated, each method was trained under the same episode budget and repeated across multiple random seeds; the learning curves report the mean episodic return across seeds with variability summarized as ±1 standard deviation. For readability, a moving-average smoothing is applied to the learning curves.
Classical variants commonly exhibit faster early-stage improvement in this benchmark environment, whereas the Enhanced DQN improves more gradually and maintains a larger variability band throughout training (see
Figure 3). The horizontal reference line at 200 indicates the commonly used “solved” threshold for LunarLander-V3 and is included as a benchmark reference rather than a primary optimization target. From a diagnostic perspective, the observed learning profile suggests that the Enhanced DQN emphasizes sustained exploration and robustness, which is desirable for non-stationary control settings where operating conditions and disturbances evolve over time.
Beyond mean learning curves, distributional analysis is useful to assess stability and tail-risk behaviors that may be obscured by averages. Therefore, steady-state reward variability is summarized using the final episodes of each training run, where performance is expected to reflect late-training behavior after substantial learning has occurred.
Figure 4 presents the reward distribution over the final 100 training episodes across five random seeds for the benchmark algorithms.
Baseline variants typically show more concentrated distributions, while the Enhanced DQN can exhibit a broader distribution and occasional low-reward episodes. Importantly, this pattern does not imply that the proposed method is designed for benchmark-specific return maximization. Instead, the retained dispersion and occasional tail events indicate that the proposed method preserves exploratory capacity even in late training stages. Such characteristics are particularly relevant for real-world, noisy control problems, which are not fully represented by simplified benchmark tasks such as LunarLander-v3.
Overall, the benchmark results serve as diagnostic evidence that the proposed Enhanced DQN exhibits stable long-horizon learning behavior under stochastic training, motivating its subsequent evaluation in the target split-type air conditioning control scenario presented in the following sections. This diagnostic evaluation is not intended to demonstrate benchmark superiority, but rather to assess learning stability under stochastic conditions prior to deployment in the target application.
6.2. Annual Energy Optimization Performance on Split-Type Air Conditioning (Main Result)
The baseline controller, operating at a fixed setpoint of 25 °C, consumed 5116.22 kWh over the simulated annual period. Across the evaluated reward-weight configurations, the proposed Enhanced DQN consistently reduced annual energy consumption, with the strongest performance observed under RW4, followed closely by RW3. Because the baseline controller is deterministic, it exhibits no variability across seeds, whereas the RL-based controller shows only modest cross-seed variation, which is explicitly reported to ensure transparency and reproducibility.
The relatively low standard deviation indicates stable policy performance across different random initializations, confirming the robustness of the learned control strategy in long-horizon operation.
6.3. Annual Energy Consumption and Savings
Figure 5 compares the annual energy consumption of the fixed 25 °C baseline with that of the two strongest reward-weight configurations of the proposed Enhanced DQN, namely RW3 and RW4, reported as mean ± standard deviation across five random seeds (n = 5). The baseline consumed 5116.22 kWh over the simulated year, whereas Enhanced DQN under RW3 and RW4 consumed 4440.69 ± 27.90 kWh and 4440.03 ± 37.50 kWh, respectively. These values correspond to absolute reductions of 675.54 ± 27.90 kWh for RW3 and 676.20 ± 37.50 kWh for RW4.
In percentage terms, the observed reductions correspond to annual energy savings of 13.20% for RW3 and 13.22% for RW4 relative to the deterministic baseline. The difference between RW3 and RW4 is therefore very small in practical terms, indicating that both reward-weight configurations achieve similarly strong long-horizon energy-saving performance. At the same time, RW3 exhibits slightly lower cross-seed variability than RW4, suggesting marginally greater consistency across random initializations, whereas RW4 achieves the lowest mean annual energy consumption among the evaluated configurations.
Table 4 summarizes the annual energy consumption, energy reduction, and energy-saving percentage for all evaluated reward-weight configurations relative to the fixed 25 °C baseline. RW1, RW2, RW3, and RW4 achieved mean annual consumptions of 4521.20 ± 50.10, 4456.61 ± 20.16, 4440.69 ± 27.90, and 4440.03 ± 37.50 kWh, corresponding to savings of 11.63%, 12.90%, 13.20%, and 13.22%, respectively. These results indicate that the proposed framework consistently improves annual energy performance relative to the fixed-setpoint baseline across multiple reward-weight settings.
Overall, the findings confirm that the proposed Enhanced DQN can deliver substantial and repeatable annual energy savings in a realistic residential control scenario without relying on hand-crafted schedules or fixed heuristics. More importantly, the close performance of RW3 and RW4 suggests that the framework is not narrowly dependent on a single reward-weight choice but rather remains effective across nearby multi-objective preference settings.
RW4 achieved the lowest mean annual energy consumption, whereas RW3 showed slightly lower cross-seed variability. Together, these results indicate that the proposed framework is robust across reward-weight configurations and that strong energy-saving performance can be achieved without a single sharply dominant setting.
6.4. Seasonal and Environmental Adaptation with PM10 Awareness
To further investigate the behavioral characteristics of the learned policy under environmental variability, the relationship between daily energy consumption and particulate matter concentration (PM10) was analyzed over the full 365-day simulation period. In this section, the analysis is presented for the RW4 configuration, which achieved the lowest mean annual energy consumption among the evaluated reward-weight settings.
For the RW4 configuration,
Figure 6 shows the 7-day moving-average profiles of daily energy consumption and PM10 concentration over the annual simulation horizon.
Figure 6 is intended to support behavioral interpretation rather than to establish a strict causal relationship between PM10 concentration and energy consumption. The purpose is to examine whether the learned control policy exhibits environmentally coherent behavior when air-quality conditions fluctuate over time. The energy trajectory under RW4 remains generally smooth over the full annual horizon, with no evidence of persistent oscillatory instability or prolonged uncontrolled spikes. Although short-term increases in energy use are observed during periods of elevated seasonal demand, the controller does not exhibit sustained excessive energy escalation during later periods of renewed PM10 increase. This suggests that the learned policy remains responsive to environmental conditions while avoiding unnecessarily aggressive operation.
Because the current environment uses a simplified PM10-driven degradation proxy rather than a detailed physical fouling model, the observed relationship should be interpreted as indirect evidence of maintenance-aware behavior rather than explicit physical validation. In this context, the relatively stable daily energy trajectory indicates that the controller is able to maintain adaptive long-horizon behavior under changing environmental conditions without sacrificing operational stability. These findings complement the annual energy results in
Section 6.2 by showing that the proposed Enhanced DQN does not merely reduce aggregate annual energy use, but also exhibits environmentally adaptive and operationally coherent behavior over time.
6.5. Reward Weight Sensitivity and Robustness Analysis
To evaluate the robustness of the proposed framework with respect to reward-weight variations, annual and monthly energy consumption was compared across four reward-weight configurations (RW1–RW4). All configurations were trained and evaluated under identical environment settings and hyperparameters, with only the reward weights varied, allowing the effect of objective preference to be isolated directly. As summarized in
Table 4, the fixed-setpoint baseline consumed 5116.22 kWh annually, whereas the Enhanced DQN achieved the following mean annual energy consumptions across five random seeds: 4521.20 ± 50.10 kWh for RW1, 4456.61 ± 20.16 kWh for RW2, 4440.69 ± 27.90 kWh for RW3, and 4440.03 ± 37.50 kWh for RW4. These correspond to annual energy savings of 11.63%, 12.90%, 13.20%, and 13.22%, respectively. While the current study evaluates a structured set of four configurations, the observed performance trends suggest that the policy behavior varies smoothly with respect to reward-weight changes rather than exhibiting abrupt instability. This indicates that the proposed framework is not highly sensitive to small perturbations in reward design within a reasonable parameter range.
These results show that all reward-weight configurations substantially improve annual energy performance relative to the baseline. The maximum deviation among RW1–RW4 is 88.75 kWh, which corresponds to only 1.73% of baseline annual consumption, indicating limited sensitivity to reward-weight tuning at the system level. Moreover, the reported standard deviations remain below 2% of annual consumption for all configurations, confirming stable behavior across random seeds. Among the tested settings, RW4 achieved the lowest mean annual energy consumption, although the difference between RW3 and RW4 is very small in practical terms. At the same time, RW3 exhibited slightly lower cross-seed variability than RW4, suggesting marginally greater consistency across random initializations. Taken together, these findings indicate that the proposed framework is robust rather than narrowly dependent on a single sharply tuned reward design.
Figure 7 compares the monthly energy consumption profiles of RW1–RW4 with the fixed 25 °C baseline over the annual simulation cycle.
In addition, the
Figure 7 further supports this conclusion by showing that all reward-weight configurations consistently reduce monthly energy consumption relative to the fixed 25 °C baseline across the full year. Importantly, the seasonal demand profile remains preserved under all configurations, with higher consumption during peak cooling months and lower demand during milder periods. This indicates that the learned policies do not distort the natural cooling-demand structure of the environment. RW1 shows the highest monthly energy use among the RL configurations, especially during higher-demand periods, whereas RW2, RW3, and RW4 remain closely clustered and consistently below the baseline. RW3 and RW4 provide the strongest overall performance, with only minor month-to-month differences between them, further indicating that the observed annual gains are not the result of narrow reward over-specialization.
Overall, the reward-weight sensitivity analysis shows that the proposed Enhanced DQN maintains stable long-horizon control performance across multiple objective preferences. Reward-weight variation primarily influences the relative emphasis of the learned behavior rather than causing policy collapse, oscillatory instability, or seasonal divergence. This practical tunability is important for real-world deployment, where different users or operating contexts may prioritize energy efficiency, comfort, and maintenance-related considerations differently, while still requiring a robust and reliable control policy.
7. Discussion
This study investigated the applicability of a deep reinforcement learning-based control framework for improving the energy efficiency and operational stability of residential split-type air conditioning systems. The proposed Enhanced Deep Q-Network (Enhanced DQN) was designed to address the long-horizon and non-stationary characteristics of residential cooling environments, including seasonal weather variability and particulate pollution. Rather than focusing on benchmark optimization alone, the approach emphasizes robustness and adaptability in real operational contexts, which is increasingly important in smart building energy management and intelligent HVAC control systems.
7.1. Algorithmic Dynamics and Diagnostic Insights
The diagnostic benchmark experiment conducted on LunarLander-v3 provides useful insight into the learning behavior of the proposed framework. In comparison with conventional value-based algorithms, including DQN and Dueling DQN, the Enhanced DQN exhibited a more gradual convergence pattern and maintained a wider reward distribution throughout training. This behavior is consistent with reinforcement learning theory, in which exploration-preserving mechanisms may reduce the rate of early convergence but improve robustness under non-stationary conditions [
14,
16]. Several architectural components contribute to this characteristic. Double Q-learning reduces overestimation bias in action-value estimation [
17], while the dueling architecture improves representation learning by separating state-value and advantage estimation [
18]. In addition, prioritized experience replay emphasizes informative transitions and accelerates learning from high-temporal-difference-error samples [
19]. When combined with entropy-aware exploration, these mechanisms promote continued policy adaptation rather than premature convergence to a potentially brittle deterministic policy. Such properties are particularly important in building control applications, where environmental disturbances and operating conditions may vary over time.
7.2. Application-Level Energy Performance and SDG 7 Relevance
The principal contribution of this research lies in the application-level evaluation using a full-year simulation of a residential air-conditioning system. Across the evaluated reward-weight configurations, the proposed Enhanced DQN consistently reduced annual electricity consumption relative to the conventional fixed 25 °C baseline. The strongest results were obtained under RW3 and RW4, which achieved annual energy savings of 13.20% and 13.22%, respectively. Although RW4 produced the lowest mean annual energy consumption, the difference between RW3 and RW4 was very small, while RW3 exhibited slightly lower cross-seed variability. These results demonstrate that reinforcement learning can identify control policies that adapt to seasonal cooling demand while maintaining stable long-horizon performance.
These findings align with broader research on reinforcement learning for building energy management, where data-driven control strategies have been shown to outperform rule-based or schedule-based methods in HVAC optimization [
22,
23]. In tropical climates such as Thailand, where residential cooling demand constitutes a substantial share of household electricity consumption, improvements in air-conditioner operation can have meaningful impacts on national energy consumption and emissions [
4]. Consequently, intelligent control strategies such as the proposed approach contribute to broader sustainability objectives, particularly those associated with Sustainable Development Goal 7 (Affordable and Clean Energy).
7.3. Maintenance-Aware Behavior Under PM10 Conditions
Beyond energy efficiency, the analysis also revealed environmentally adaptive behavior in relation to particulate matter exposure. During periods of elevated PM10 concentration, the RL-controlled system did not exhibit abrupt increases in energy demand or unstable operational patterns. Instead, the policy maintained relatively stable energy consumption while responding to environmental variability. Although the present simulation did not explicitly model filter degradation or component wear, this pattern suggests a form of implicit maintenance-aware control. Previous studies suggest that particulate accumulation on air-conditioning filters and heat exchanger surfaces can degrade system efficiency, reduce heat-transfer performance, and increase maintenance requirements over time [
9,
10]. By avoiding excessive high-load operation during periods of poor air quality, the learned policy may indirectly reduce maintenance burden and extend equipment life. The integration of operational control with maintenance considerations is increasingly discussed in the literature on predictive and prescriptive maintenance systems [
31,
32]. However, most existing frameworks treat maintenance planning and control optimization as separate layers. The present study contributes to bridging this gap by demonstrating that reinforcement learning-based control policies can implicitly incorporate maintenance-relevant signals into operational decisions.
7.4. Reward-Weight Sensitivity, Policy Tunability, and SDG 13 Relevance
Another important finding concerns the robustness of the control framework to reward-weight variations. The reward-sensitivity experiment showed that annual energy savings remained relatively consistent across multiple reward configurations (RW1–RW4). Among the evaluated settings, RW4 produced the lowest mean annual energy consumption, while RW3 showed slightly lower variability across seeds. However, the difference between RW3 and RW4 remained very small, indicating that the system is not overly sensitive to specific reward-weight choices. Reward shaping plays a crucial role in reinforcement learning applications because it determines how competing objectives are balanced [
14]. In building energy management, these objectives typically include energy consumption, thermal comfort, equipment stability, and economic cost [
22,
24]. The observed stability across reward configurations suggests that the proposed architecture learns a generalizable policy structure rather than exploiting narrow reward incentives.
From a sustainability perspective, this tunability is also relevant to SDG 13, which emphasizes climate action through adaptive and efficient resource use. A control framework that maintains stable energy-saving performance under different objective preferences is better suited to practical deployment in diverse environmental and operational conditions. This characteristic increases the potential of the proposed method as a flexible decision-support mechanism for intelligent and sustainable air-conditioning control.
7.5. Practical Deployment Challenges and Limitations
Despite the encouraging results, several practical issues remain for real-world deployment. First, implementation in residential environments would require additional IoT sensing infrastructure, including PM, temperature, and humidity sensors, together with a microcontroller and communication interface for continuous monitoring. A practical sensing node can be implemented using commercially available components such as a PMS7003 particulate matter sensor, an SHT30/SHT31 temperature-humidity sensor, and an ESP32-based controller. Based on currently available component prices in Thailand, such a sensing node would cost approximately 37–61 USD per installation point, excluding installation, calibration, and long-term maintenance costs. PMS7003 modules are commonly sold in the range of about 25–37 USD, SHT31 sensors around 4–6 USD, and ESP32 development boards around 1–7 USD, depending on vendor and board type.
Second, the present framework evaluates comfort mainly through environmental variables and a temperature-based reward formulation, without incorporating direct occupant comfort feedback. In practice, perceived comfort may differ across users and may also be influenced by activity level, clothing, occupancy patterns, and personal preference. As a result, the current formulation should be interpreted as a simplified comfort-oriented control model rather than a fully personalized comfort management system.
Third, the study relies on a simulation-based environment representing a residential air-conditioning system. Although the simulation incorporates realistic weather and PM10 data, it does not fully capture all physical dynamics of real buildings, such as occupant behavior, internal heat gains, and detailed thermal inertia. In particular, indoor thermal response is approximated using a simplified model driven by outdoor conditions and cooling actions, rather than a full physics-based building simulation. In addition, although the proposed Enhanced DQN demonstrates stable performance in offline training, practical real-time deployment would require reliable sensing, stable communication, robust data preprocessing, and sufficiently efficient inference on embedded or edge computing hardware. These requirements introduce additional implementation complexity beyond the simulation setting considered in this study.
In addition, the current framework does not incorporate dynamic or time-varying electricity pricing. The reward structure is therefore based on static energy-related penalties rather than real tariff fluctuations, which limits the economic realism of the present formulation.
7.6. Future Work Direction
Future work will focus on extending the framework toward real-world validation, including hardware-in-the-loop experiments and deployment in actual residential environments. In addition, further investigation of multi-objective optimization with dynamic electricity pricing and user feedback integration will enhance the practical applicability of the proposed approach.
In summary, this study demonstrates that reinforcement learning-based control can effectively address the challenges of long-horizon, non-stationary residential cooling systems. The proposed Enhanced DQN achieves significant energy savings while maintaining stable and adaptive behavior under varying environmental conditions. Importantly, the framework exhibits robustness across reward configurations without policy instability, indicating strong generalizability. These findings highlight the potential of reinforcement learning as a practical and scalable solution for intelligent energy management in residential buildings, contributing to sustainable and data-driven HVAC control systems.
8. Conclusions
This study developed an Enhanced Deep Q-Network (Enhanced DQN) framework for long-horizon energy optimization and maintenance-aware control of residential split-type air conditioners. The proposed approach integrates multiple stability- and exploration-oriented components, including Double Q-learning, a dueling architecture, prioritized experience replay, multi-step returns, Bayesian regularization, and entropy-guided action selection, to improve robustness under non-stationary operating conditions and extended control horizons. Rather than focusing on benchmark dominance, LunarLander-v3 was used as a diagnostic environment to examine learning dynamics, cross-seed variability, and exploration behavior. The results indicate that the proposed framework maintains sustained exploration and adaptive learning behavior under stochastic conditions. These results suggest that the proposed framework is computationally feasible for offline training, although further optimization is required for real-time deployment in embedded or edge environments.
At the application level, evaluation in a 365-day hourly simulation demonstrated that the Enhanced DQN consistently reduced annual electricity consumption relative to the fixed 25 °C baseline. Across the evaluated reward-weight configurations, the strongest performance was obtained under RW3 and RW4, which achieved annual energy savings of 13.20% and 13.22%, respectively. RW4 yielded the lowest mean annual energy consumption, whereas RW3 showed slightly lower cross-seed variability. These results confirm that the learned policy can effectively adapt to seasonal demand variations while maintaining stable long-horizon performance.
Seasonal and environmental analyses further show that the learned policy avoids excessive energy spikes under high PM10 conditions, indicating environmentally adaptive and operationally stable behavior. Although maintenance effects are inferred indirectly through environmental signals rather than explicitly modeled, the observed behavior is consistent with maintenance-aware operation that may reduce mechanical stress over time.
Reward-weight sensitivity analysis further demonstrated controlled tunability. Across configurations (RW1–RW4), the maximum deviation in annual energy consumption remained small relative to baseline demand, indicating relatively low sensitivity to reward-weight variation under the evaluated conditions. The close performance of RW3 and RW4 suggests that the proposed framework does not rely on a narrowly optimized reward setting but can maintain strong energy-saving performance across nearby multi-objective preference structures.
Despite these promising results, several limitations should be noted. The evaluation is based on a simulation environment and does not fully capture real-world complexities such as occupant behavior, internal heat gains, and detailed physical system dynamics. In addition, although the main reported experiments were evaluated using five random seeds to improve statistical robustness and reproducibility, maintenance effects are not explicitly modeled through detailed degradation dynamics. Furthermore, real-world deployment would require integration with IoT sensing infrastructure and efficient real-time control implementation.
Future work will focus on real-world validation, including hardware-in-the-loop testing and deployment in residential environments. Extensions of the framework to include explicit degradation modeling, adaptive reward-weight scheduling, dynamic electricity pricing, and multi-building or multi-agent coordination will further enhance its applicability and scalability. Overall, the findings demonstrate that reinforcement learning-based control offers a promising and practical solution for intelligent, adaptive, and energy-efficient HVAC operation in residential buildings, and these results highlight the practical potential of reinforcement learning as a scalable and adaptive control paradigm for next-generation residential energy systems.