Energy-Efficient Access Point Switch On/Off in Cell-Free Massive MIMO Using Proximal Policy Optimization

García-Barrios, Guillermo; Alonso, Alberto; Fuentes, Manuel

doi:10.3390/electronics15061219

Open AccessArticle

Energy-Efficient Access Point Switch On/Off in Cell-Free Massive MIMO Using Proximal Policy Optimization

by

Guillermo García-Barrios

^1,*

,

Alberto Alonso

² and

Manuel Fuentes

¹

5G Communications for Future Industry Verticals S.L. (Fivecomm), Camí de Vera s/n (6D Building), 46022 Valencia, Spain

²

International Business Machines Corporation (IBM), 28020 Madrid, Spain

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(6), 1219; https://doi.org/10.3390/electronics15061219

Submission received: 14 February 2026 / Revised: 9 March 2026 / Accepted: 12 March 2026 / Published: 14 March 2026

(This article belongs to the Special Issue AI Techniques for Integrated Sensing and Communication in Future Networks)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The increasing densification of cell-free massive multiple-input multiple-output (MIMO) networks makes access point switch on/off (ASO) a key mechanism for improving energy efficiency in future wireless systems. While reinforcement learning (RL) has been explored for ASO, differences in modeling assumptions and evaluation scope leave open questions regarding robustness and scalability. In this work, ASO is investigated from an explicit energy-efficiency perspective using a RL framework based on Proximal Policy Optimization (PPO). The policy learns state-dependent AP activation under partial observability using compact per-access point (AP) large-scale fading statistics and power parameters, without requiring instantaneous small-scale channel state information or combinatorial search, enabling practical online implementation. A comprehensive evaluation is conducted under a unified and reproducible simulation framework across three cell-free deployment scenarios of increasing size that preserve AP density while incorporating realistic channel and power consumption models. Performance is assessed through both average and distribution-based metrics. Numerical results show that the PPO-based policy consistently outperforms random activation and the all-on baseline, achieving energy-efficiency improvements of up to 66% and nearly 50%, respectively, while activating a comparable number of APs. Moreover, the learned policy maintains robust performance as the network scales, reducing the likelihood of highly energy-inefficient operating regimes.

Keywords:

cell-free massive multiple-input multiple-output; access point switch on/off; energy efficiency; reinforcement learning; proximal policy optimization; green wireless communications; network energy optimization; wireless networks

1. Introduction

The deployment of fifth generation (5G) wireless networks is currently progressing worldwide. At the same time, the evolution towards beyond-fifth generation (B5G) and sixth generation (6G) systems is already shaping the design principles of future wireless communications. These emerging networks are expected to support extremely high data rates, ultra-low latency, and massive connectivity, while simultaneously meeting stringent requirements on energy efficiency and environmental sustainability. As a result, reducing the energy footprint of dense wireless infrastructures has become a central challenge in the design of next-generation networks.

In this context, cell-free massive multiple-input multiple-output (MIMO) has emerged as a promising network architecture for future wireless systems. By relying on a large number of distributed access points (APs) that cooperatively serve all users without predefined cell boundaries, cell-free massive MIMO can significantly improve coverage uniformity, mitigate inter-cell interference, and enhance spectral efficiency compared to conventional cellular deployments [1,2,3]. These features make cell-free architectures attractive candidates for meeting the performance demands of future wireless networks.

However, the large-scale and distributed nature of cell-free massive MIMO systems also raises important practical concerns. In particular, the dense deployment of AP leads to a substantial increase in aggregate power consumption, which directly impacts operational costs and network sustainability. Importantly, it is neither necessary nor energy-efficient to keep all AP active at all times, especially under moderate traffic conditions or heterogeneous user distributions. This observation motivates the design of access point switch on/off (ASO) strategies, where only a subset of APs is activated in order to reduce energy consumption while maintaining acceptable quality of service (QoS).

Designing effective ASO policies in cell-free massive MIMO is a challenging task. The selection of the active AP subset is inherently combinatorial, with a search space that grows exponentially with the number of deployed APs. Moreover, ASO decisions are closely coupled with channel conditions, user locations, and power allocation strategies, which further complicates the optimization process. These characteristics make the ASO problem particularly challenging in large-scale and dynamic network scenarios, and call for solutions that are both energy-efficient and computationally tractable.

1.1. Problem Statement

We address the ASO problem from an energy efficiency (EE) perspective, where EE is defined as the ratio between achievable throughput and total power consumption. Unlike pure power minimization or spectral efficiency maximization, EE optimization explicitly captures the trade-off between achieved throughput and total energy expenditure.

Formally, the ASO problem consists of selecting a subset of active APs from a potentially large pool of candidates. This task is inherently combinatorial, since the number of possible activation patterns grows exponentially with the number of APs. In addition, the impact of activating or deactivating a given AP depends on multiple interacting factors, including channel conditions, user locations, and power allocation strategies. Consequently, identifying energy-efficient activation patterns becomes increasingly challenging as the network size grows.

Traditional heuristic and optimization-based approaches can provide valuable insights into the structure of energy-efficient activation patterns. However, their applicability to large-scale cell-free scenarios may be limited by requirements on global network information, static modeling assumptions, or increasing computational complexity. These limitations are further exacerbated when ASO decisions must be evaluated across multiple network realizations to account for the inherent variability of wireless environments.

Consequently, there is a need for scalable and computationally efficient approaches that can approximate energy-efficient AP activation strategies in large cell-free massive MIMO systems, while enabling fair and transparent comparisons against fundamental baseline methods under a unified and reproducible evaluation framework. This problem setting motivates the investigation of learning-based techniques, and in particular reinforcement learning (RL), as practical tools to address the ASO problem from an energy efficiency standpoint.

1.2. Contributions

Recent works on ASO in cell-free massive MIMO have explored both heuristic and learning-based solutions under a variety of modeling assumptions and performance objectives. While these studies provide valuable insights, differences in problem formulations, power consumption models, baseline choices, and evaluation scales make it challenging to draw consistent conclusions regarding energy efficiency, scalability, and robustness across approaches. In particular, many learning-based solutions embed ASO within broader optimization problems or evaluate performance under limited-scale scenarios, which complicates a transparent assessment of how learned policies behave as the network size and combinatorial complexity increase.

Motivated by these observations, this paper investigates ASO in cell-free massive MIMO from an explicit energy-efficiency perspective and adopts a RL framework based on proximal policy optimization (PPO). Rather than introducing a new learning algorithm, the contribution of this work lies in providing a methodologically clean, unified, and reproducible evaluation of PPO-based ASO under realistic modeling assumptions. The main contributions of this work are summarized as follows:

We formulate the ASO problem explicitly as the selection of an energy-efficient subset of APs in cell-free massive MIMO systems and address it through a scalable RL framework operating under partial observability. The proposed formulation avoids exhaustive combinatorial search and does not require full large-scale fading knowledge, making it suitable for practical online implementation.
We develop a PPO-based ASO policy that directly optimizes network energy efficiency and benchmark it against well-defined baseline strategies, including random activation, the all-on configuration, and a greedy oracle upper bound. This comparison framework makes explicit the underlying information assumptions and enables a transparent interpretation of the learned policy’s performance.
We conduct a systematic performance evaluation across multiple deployment scenarios of increasing scale and density, designed to preserve AP density while increasing network size. This evaluation framework allows the scalability and robustness of the learning-based policy to be assessed under controlled and reproducible conditions.
Beyond average performance metrics, we provide a detailed distributional analysis based on boxplots, cumulative distribution functions (CDFs), and EE conditioned on the number of active APs. This analysis characterizes not only typical performance but also robustness and reliability, revealing how the learned policy reduces the likelihood of highly energy-inefficient operating regimes.

1.3. Structure

The remainder of this paper is organized as follows. Section 2 reviews the existing literature on access point switch on/off strategies in cell-free massive MIMO systems, with particular emphasis on energy-efficiency-oriented approaches and reinforcement-learning-based solutions. Section 3 presents the system model, including the channel model, channel estimation procedure, downlink transmission framework, and the adopted power consumption and energy efficiency formulations. The ASO problem is formally defined in Section 4, together with the baseline strategies used for benchmarking purposes. Section 5 introduces the proposed reinforcement learning framework, detailing the state, action, and reward definitions, as well as the PPO-based solution and training procedure. Section 6 describes the simulation setup and the comparison framework adopted for performance evaluation. Numerical results and discussion are presented in Section 7, including baseline analysis, training behavior, and a detailed performance evaluation across the considered scenarios. Finally, Section 8 concludes the paper and outlines directions for future research.

1.4. Reproducible Research

Results presented in this paper can be reproduced using the dataset and code provided at: https://github.com/Fivecomm/cell-free-aso-ppo.git (accessed on 13 February 2026).

2. Related Work

ASO has been widely investigated as a mechanism to improve the energy efficiency of dense wireless networks, including cell-free massive MIMO systems [4]. In these architectures, a large number of distributed APs cooperatively serve users, enabling selective activation without necessarily compromising quality of service.

Early works primarily relied on heuristic or optimization-based strategies. For instance, ref. [5] demonstrated that selective AP deactivation can significantly enhance energy efficiency compared to always-on deployments, although the proposed methods require global information and scale poorly with network size. Optimization-based approaches such as [6] formulated joint AP selection and power allocation using sparsity-promoting techniques, achieving notable gains at the cost of increased computational complexity. Other works considered structured activation rules based on effective channel gains [7], user-centric clustering [8], or joint beamforming and AP switching in mmWave systems [9]. While these studies confirm the potential of ASO, they often rely on simplified power models or limited-scale evaluations.

More recent optimization-based studies have further investigated energy-efficient AP selection under realistic network modeling assumptions. For example, ref. [10] proposed a clustered cell-free architecture with a ratio-fixed AP selection mechanism and derived a closed-form expression for the optimal AP-per-user ratio that maximizes average network energy efficiency under an explicit backhaul-aware power model. Similarly, system-level analyses such as [11] examined scalable cell-free architectures with detailed hardware-based power consumption models, highlighting the interplay between processing splits, fronthaul constraints, and energy efficiency. While these works provide valuable analytical insights into deployment density and clustering strategies, they focus on average or planning-level optimization rather than per-realization ASO decisions under partial observability.

More recently, RL techniques have been explored to address the combinatorial nature of the ASO problem. Early deep reinforcement learning (DRL)-based approaches employed deep Q-network (DQN) or deep deterministic policy gradient (DDPG) architectures to control AP activation, often targeting QoS satisfaction, delay minimization, or total power reduction rather than direct energy efficiency maximization [12,13,14]. Policy-gradient methods such as PPO and advantage actor-critic (A2C) have also been adopted to control AP sleep modes or jointly optimize activation and power allocation [15,16,17]. More advanced DRL formulations have also considered hierarchical control structures. In particular, ref. [18] proposed a hierarchical DDPG-based framework that jointly optimizes AP clustering and power allocation over multiple time scales to maximize system-level energy efficiency under QoS constraints. However, existing RL-based works frequently rely on simplified or incomplete power consumption models [12,15,19], evaluate relatively small-scale scenarios [12,14,17], or compare against limited baseline strategies, typically random activation or internal algorithmic variants.

To facilitate a structured comparison of existing approaches, Table 1 summarizes the main characteristics of representative ASO-related works in cell-free massive MIMO systems, with a particular focus on RL-based solutions. This comparison provides a concise overview of the methodological gaps identified in existing works. The table highlights differences in terms of optimized metrics, power consumption modeling, evaluation scale, and baseline strategies. In particular, the evaluation scale is categorized as small-, moderate-, or large-scale based on the number of APs considered in each study. In particular, small-scale evaluations typically involve fewer than 20 APs, moderate-scale scenarios consider on the order of 40–60 APs, and large-scale evaluations exceed approximately 60 APs. For reference, the scenarios considered in this work range from 32 to 96 APs while preserving AP density.

Despite recent progress in both optimization-based and learning-based formulations, the extent to which RL can approximate energy-efficient AP activation policies under realistic power consumption modeling and increasing network scale remains insufficiently explored in a unified and methodologically controlled setting. In particular, existing works either focus on joint clustering and power control, average network-level optimization, or simplified power models, making it difficult to isolate the scalability and robustness of standalone ASO policies under partial observability. Moreover, systematic evaluations across increasing deployment scales while preserving comparable infrastructure density remain limited. These limitations motivate the present study, which investigates ASO as a standalone energy-efficiency maximization problem within a unified and reproducible evaluation framework.

3. System Model and Energy Efficiency Formulation

This work considers a distributed cell-free massive MIMO architecture deployed over a square geographical area. The network consists of K single-antenna user equipments (UEs) and L APs, each equipped with N antennas. The APs operate cooperatively without predefined cell boundaries and jointly serve all users within the coverage area.

This section presents the system model—including the physical-layer signal model and the adopted power consumption framework—which jointly define the foundation for the ASO problem addressed in this paper. The focus is on a scalable and widely adopted baseline that enables transparent and reproducible evaluation of energy-efficient AP activation strategies, rather than optimizing specific physical-layer transmission techniques. Energy efficiency is defined as the ratio between achievable downlink throughput and total network power consumption, explicitly accounting for AP active/sleep modes, fronthaul, and UE power under downlink operation.

3.1. Channel Model

The wireless propagation environment is modeled using correlated Rayleigh fading, following a widely adopted approach in recent studies on cell-free massive MIMO systems [23,24,25]. This model provides a representative description of non-line-of-sight (NLoS) propagation while preserving analytical tractability and scalability, which are essential for large-scale system evaluations.

The channel between AP l and UE k is represented by the random vector

h_{k l} \sim N_{C} (0_{N}, R_{k l}),

(1)

where

R_{k l} \in C^{N \times N}

denotes the spatial correlation matrix. This matrix captures both large-scale and small-scale propagation effects, including path loss, spatially correlated log-normal shadow fading, and antenna element correlation. Explicit formulations of

R_{k l}

are adopted from [1] without modification. Specifically, the model assumes a three-dimensional urban macro (UMa) deployment with NLoS propagation, as described in [26] (Section 1.3.1).

The spatial correlation matrix depends on the array geometry and the angular distribution of the multipath components. In line with [1], each AP is equipped with a uniform linear array (ULA) with half-wavelength antenna spacing. Under the far-field assumption, the

(m, ℓ)

-th element of a generic spatial correlation matrix

R_{k l}

can be expressed as

{[R_{k l}]}_{m ℓ} = β_{k l} \int \int e^{j π (m - ℓ) sin (\bar{φ}) cos (\bar{θ})} f (\bar{φ}, \bar{θ}) d \bar{φ} d \bar{θ},

(2)

where

β_{k l}

denotes the large-scale fading coefficient between AP l and UE k,

\bar{φ}

and

\bar{θ}

represent the azimuth and elevation angles of the multipath components measured with respect to the array broadside, and

f (\bar{φ}, \bar{θ})

is their joint probability density function. The above expression corresponds to the expectation

E \{e^{j π (m - ℓ) sin (\bar{φ}) cos (\bar{θ})}\},

(3)

computed over the angular distribution of the multipath components. In the considered UMa NLoS scenario, the angular spreads and pathloss model follow the standard configuration reported in [1,26].

The choice of this channel model is motivated by its extensive use in the cell-free massive MIMO literature and its ability to enable fair and reproducible comparisons across different AP activation strategies, rather than by the pursuit of physical-layer optimality.

3.2. Channel Estimation

The system operates under a time-division duplex (TDD) protocol, which is commonly adopted in cell-free massive MIMO systems due to channel reciprocity and its ability to acquire downlink channel state information (CSI) from uplink pilots without additional feedback overhead [23,27,28,29].

Pilot transmission and allocation follow a standard user-centric approach based on the master AP strategy described in [1]. This strategy assigns each user a serving AP responsible for pilot coordination, enabling scalable pilot reuse while limiting signaling overhead. Since pilot assignment is not the focus of this work, this mechanism is adopted as a fixed baseline throughout the simulations.

Uplink channel estimation is performed using the minimum mean-squared error (MMSE) estimator, which provides optimal estimation performance when the channel statistics are known [3]. Given the slow temporal variation of large-scale fading coefficients, it is assumed that these parameters remain constant over multiple coherence intervals and are available at the central processing unit (CPU). The resulting channel estimates are then used for downlink precoding and performance evaluation.

The adopted estimation framework is selected for its widespread use in the literature and its favorable trade-off between accuracy and computational complexity, ensuring that the subsequent ASO analysis is not biased by non-standard channel acquisition procedures.

3.3. Downlink Data Transmission

During the downlink data transmission phase, the active APs simultaneously transmit data symbols to the users using linear precoding. The received signal at UE k can be expressed as

y_{k}^{dl} = \sum_{l = 1}^{L} h_{k l}^{H} (\sum_{i = 1}^{K} w_{i l} ς_{i}) + n_{k},

(4)

where

w_{i l} \in C^{N}

denotes the precoding vector applied by AP l to the unit-power data symbol

ς_{i}

intended for UE i, and

n_{k} \sim N_{C} (0, σ_{dl}^{2})

represents additive receiver noise.

For downlink precoding, this work adopts large-scale fading decoding combined with local partial minimum mean-squared error (LP-MMSE) processing. LP-MMSE is a scalable approximation of centralized MMSE precoding that relies primarily on large-scale fading information and local channel estimates, significantly reducing fronthaul signaling and computational complexity [30,31]. This makes LP-MMSE particularly suitable for large-scale and distributed cell-free deployments. The specific choice of precoding strategy is not central to the ASO problem addressed in this paper and is therefore adopted as a representative and widely used baseline.

This work focuses exclusively on downlink operation for three main reasons. First, ASO decisions primarily impact the downlink energy budget, where transmit power, RF circuitry, and fronthaul traffic dominate the overall network consumption, whereas uplink power is largely constrained by UE-side limitations and is less directly controllable by the network. Second, in practical cell-free deployments operating under TDD, uplink and downlink transmissions share the same large-scale propagation characteristics (pathloss and shadowing) due to channel reciprocity. Since ASO decisions are mainly driven by these large-scale fading statistics together with infrastructure power consumption parameters, the structural factors governing AP activation remain largely consistent across transmission directions. Third, restricting the analysis to downlink operation simplifies the problem formulation and enables a focused and rigorous investigation of the combinatorial ASO problem and the scalability of the PPO-based policy, without introducing additional variables related to uplink scheduling and power control.

3.4. Network Scalability

Scalability is a fundamental design requirement in cell-free massive MIMO systems, particularly in dense deployments where the number of APs and users can be large. A system is considered scalable if the computational complexity and signaling overhead per AP remain bounded as the network size increases [1].

In the considered architecture, each AP performs local signal processing tasks, including channel estimation, precoding, and power control, using locally available information and limited statistical knowledge. This distributed processing paradigm avoids excessive fronthaul signaling and prevents the computational burden from growing unbounded at any centralized entity.

The estimation and precoding techniques adopted in this work adhere to this scalability principle, ensuring that the per-AP processing requirements do not depend on the total number of users or APs in the network. This property is particularly relevant for the analysis of ASO strategies, since ASO decisions must remain computationally feasible as the network density increases.

By relying on a scalable physical-layer baseline, the proposed system model enables a fair and realistic evaluation of energy-efficient AP activation policies in large cell-free massive MIMO deployments, without introducing additional complexity that could obscure the impact of ASO decisions.

3.5. Power Consumption Model

Following the downlink power consumption framework in [5], the total power consumption during downlink payload transmission is modeled as

P_{T}^{DL} = P_{T, fix}^{DL} + B \sum_{k = 1}^{K} ξ_{UE} {SE}_{k}^{DL} + \sum_{l \in L_{A}} (\frac{τ_{d}}{τ_{c}} \frac{P_{l}^{tx}}{α_{AP}} + B ξ_{FH} {SE}^{DL}),

(5)

where

{SE}_{k}^{DL}

denotes the downlink spectral efficiency of user k measured in bit/s/Hz,

{SE}^{DL} = \sum_{k = 1}^{K} {SE}_{k}^{DL}

is the aggregate downlink spectral efficiency, B is the system bandwidth expressed in Hz, and

L_{A}

denotes the set of active APs. The coefficient

α_{AP}

represents the power amplifier efficiency at the APs, while

ξ_{UE}

and

ξ_{FH}

model the traffic-dependent power consumption of user terminals and fronthaul links, respectively. The fixed power term

P_{T, fix}^{DL}

accounts for circuitry and infrastructure consumption at both active and sleeping APs and their associated fronthaul links, following the downlink fixed power model described in [5] (Section III-B).

All numerical values of the power consumption parameters used in this work are summarized in Table 2, adopting the default settings reported in [5].

It is emphasized that the adopted power consumption parameters correspond to a representative and widely used reference configuration in the cell-free massive MIMO literature [5]. The objective of this work is not to calibrate hardware-specific power coefficients, but to evaluate the structural behavior and scalability of ASO policies under a consistent and reproducible system-level power model. The adopted formulation represents a system-level power consumption model widely used in cell-free massive MIMO studies and captures the dominant infrastructure energy components that influence AP activation decisions. Hardware-level effects such as load-dependent circuit efficiencies or amplifier nonlinearities are beyond the scope of this system-level analysis and are therefore not explicitly modeled.

While practical deployments may exhibit variations in circuitry efficiency, fronthaul implementation, or traffic-dependent power consumption, the adopted formulation captures the dominant contributors to downlink energy expenditure, namely transmit power, RF-chain power, fixed infrastructure consumption, and fronthaul traffic-dependent components. Therefore, the comparative insights obtained in this work primarily reflect the interplay between spectral efficiency saturation and infrastructure power scaling, which remains qualitatively valid across a wide range of realistic deployments.

Furthermore, the interaction between AP switch on/off decisions and dynamic power control is intentionally decoupled in this study. By adopting a fixed and scalable physical-layer baseline, the focus remains on isolating the standalone impact of AP activation on system-level energy efficiency, thereby enabling transparent comparison across scenarios of increasing network size.

3.6. Energy Efficiency Definition

The downlink EE of the considered cell-free massive MIMO system is defined as the ratio between the achievable downlink throughput and the total downlink power consumption. Since the system model is formulated in terms of downlink spectral efficiency (SE), the system-level EE is expressed as

EE = \frac{B {SE}^{DL}}{P_{T}^{DL}},

(6)

where

{SE}^{DL}

denotes the aggregate downlink spectral efficiency, B is the system bandwidth, and

P_{T}^{DL}

is the total downlink power consumption defined in (5). The resulting energy efficiency metric is measured in bit/Joule.

It is emphasized that the EE metric in (6) explicitly accounts for the power consumption of active and sleeping APs, fronthaul links, and UEs. Consequently, variations in the set of active APs directly affect both the aggregate SE and the total power consumption, making EE a suitable and sensitive performance metric for the evaluation of ASO strategies.

This definition focuses on downlink operation and system-level performance, providing a consistent and interpretable objective for the subsequent ASO problem formulation and RL-based optimization.

4. Access Point Switch On/Off Problem Formulation and Baselines

This section formally defines the ASO problem as an energy-efficiency maximization task and introduces the baseline strategies used to benchmark the proposed RL approach. After presenting the combinatorial formulation of the ASO problem, two reference schemes are described: a random activation strategy providing a practical lower bound and a greedy energy-efficiency-oriented approach serving as an upper-bound approximation.

4.1. Problem Formulation

The objective of the ASO problem is to determine the subset of a APs that should remain active in order to maximize the downlink EE of the considered cell-free massive MIMO network.

Let

x = [x_{1}, x_{2}, \dots, x_{L}], x_{l} \in {0, 1},

(7)

denote the AP activation vector, where

x_{l} = 1

indicates that AP l is active and

x_{l} = 0

corresponds to sleep mode. The set of active APs is therefore given by

L_{A} = {l ∣ x_{l} = 1}

.

For a given AP activation configuration

x

, the total downlink power consumption of the network is denoted by

P_{T}^{DL} (x)

and follows the power consumption model described in Section 3.5. Similarly, the aggregate downlink SE achieved under configuration

x

is denoted by

{SE}^{DL} (x)

. Then, the resulting downlink energy efficiency is then defined as

EE (x) = \frac{B {SE}^{DL} (x)}{P_{T}^{DL} (x)},

(8)

The ASO problem can then be formulated as the following optimization problem:

x^{⋆} = arg max_{x \in {0, 1}^{L}} EE (x) .

(9)

Problem (9) is inherently combinatorial. An exhaustive search would require evaluating

2^{L}

possible activation patterns, which becomes computationally infeasible even for moderate values of L. This motivates the need for efficient approximation strategies capable of identifying energy-efficient AP activation patterns without exploring the entire configuration space.

To assess the achievable performance range of practical ASO solutions, reference baselines such as lower-bound and upper-bound strategies are commonly employed. These baselines provide useful benchmarks for evaluating the effectiveness of learning-based and heuristic ASO approaches and will be described in detail in the following section.

4.2. Baseline Strategies

To evaluate the effectiveness of any ASO strategy, it is essential to establish suitable reference benchmarks that characterize the achievable performance range. In this work, two baseline strategies are considered: a random-selection ASO scheme that provides a practical lower bound on energy efficiency, and a greedy energy-efficiency/oriented ASO scheme that yields a tight upper-bound approximation. The selection of these baselines follows the methodology adopted in [5], where similar reference strategies are used to assess energy-efficient operation in cell-free massive MIMO systems. These baselines are used to contextualize the performance of the proposed AP-based approach.

Random-Selection ASO (Lower Bound): A practical lower bound on the achievable energy efficiency can be obtained by employing a random AP activation mechanism. In this scheme, each AP is independently switched on or off with equal probability, while the number of active APs can be controlled to match a target activation level. Since this strategy does not account for channel conditions, interference, or the impact of individual APs on the aggregate throughput, the resulting energy efficiency is typically poor. Nevertheless, random-selection ASO (RS-ASO) provides a simple and computationally inexpensive baseline that serves as a conservative lower bound for the performance of more sophisticated ASO strategies.
Greedy Energy-Efficiency ASO (Upper Bound): An upper performance bound can be approximated by identifying, for a given number of active APs, the subset that maximizes energy efficiency. In principle, this would require evaluating all possible AP activation combinations, which is computationally infeasible due to the exponential growth of the search space. To obtain a tractable approximation, a greedy energy-efficiency-oriented ASO strategy is adopted. The algorithm starts with all APs active and iteratively switches off one AP at a time. At each iteration, all candidate configurations obtained by deactivating a single AP are evaluated, and the configuration yielding the highest energy efficiency is selected. This process continues until no further improvement in EE is observed. This optimal EE greedy ASO (OG-ASO) provides a tight and computationally feasible approximation of the maximum achievable energy efficiency and is used as an upper reference for performance evaluation. It should be noted that the greedy scheme is evaluated exclusively as an offline reference to approximate the location of energy-efficient operating points. Due to its combinatorial nature, it is not intended for real-time implementation, but rather serves as an interpretability and benchmarking tool.

The RS-ASO and OG-ASO strategies delimit the feasible performance range of ASO schemes in terms of energy efficiency, providing lower and upper reference bounds, respectively:

{EE}^{RS} \leq EE (x) \leq {EE}^{OG} .

(10)

Importantly, the objective of ASO is not simply to minimize the number of active APs or to keep all APs active, but to identify both the number and the combination of APs that jointly maximize energy efficiency. Since evaluating all possible activation configurations is computationally infeasible for realistic network sizes, learning-based approaches are particularly appealing. In the following section, a RL framework is introduced to efficiently approximate energy-efficient AP activation patterns without explicit combinatorial search.

It is important to emphasize that OG-ASO is an oracle-type benchmark: it assumes full knowledge of the instantaneous EE associated with each possible AP subset and is evaluated offline by exhaustively searching over activation patterns for all the number of active APs. In contrast, the proposed PPO policy operates online under partial observability, using only compact state information and without access to future realizations or combinatorial EE evaluations. Therefore, OG-ASO is not a practical competitor but a conceptual upper bound that reveals the structure and location of EE-optimal operating points (in terms of both EE and the corresponding number of active APs). Direct numerical comparison between PPO and OG-ASO must be interpreted in this light and not as a fair head-to-head performance contest.

5. Reinforcement Learning Framework

The ASO problem defined in Section 4.1 is inherently combinatorial, as it requires selecting an energy-efficient activation pattern from

2^{L}

possible configurations. While baseline strategies provide useful performance bounds, they rely on explicit evaluations of candidate configurations and are not suitable for real-time operation in large-scale networks.

To address this challenge, a RL framework is adopted in this work, in which an agent learns to infer energy-efficient AP activation decisions based on compact, partially observable network state information. The agent interacts with a simulated cell-free massive MIMO environment that captures realistic channel and power-consumption dynamics. The goal of the RL agent is to approximate the solution of the EE maximization problem without explicit combinatorial search, while maintaining low inference complexity once training is completed.

5.1. State, Action, and Reward Definition

In the proposed RL framework, the agent interacts with the environment by observing the network state, selecting an AP activation pattern, and receiving a reward that reflects the resulting energy efficiency. The definitions of the state, action, and reward are detailed as follows.

State: The state is designed to capture the structural information that determines the EE contribution of each AP, while avoiding dependence on instantaneous small-scale fading. This choice reflects the slow time scale of ASO decisions and promotes stable learning. The observation available to the agent is collected and preprocessed at the CPU, which has access to large-scale channel statistics and AP-side power parameters, as commonly assumed in cell-free massive MIMO architectures.
For each AP l, the state includes a compact set of features derived from its large-scale fading coefficients toward all UEs, namely the mean, maximum, and minimum values. In addition, the power consumption associated with activating AP l is included to account for its energy cost. The resulting local observation vector is given by

$o_{l} = [\begin{matrix} \frac{1}{K} \sum_{k = 1}^{K} {\tilde{β}}_{l, k} \\ max_{k} {\tilde{β}}_{l, k} \\ min_{k} {\tilde{β}}_{l, k} \\ P_{l} \end{matrix}],$

(11)

where ${\tilde{β}}_{l, k}$ denotes the normalized large-scale fading coefficient between AP l and UE k, and $P_{l}$ represents the power consumption of AP l in active mode. The global state is obtained by stacking the observations of all APs, resulting in a fixed-dimensional representation that scales linearly with the number of APs.
Large-scale fading coefficients are normalized using affine transformations with constants derived offline from scenario-level statistics and applied consistently across all training and evaluation episodes. Since large-scale fading evolves over much longer time scales than small-scale fading, these features and their normalization only need to be updated at the large-scale fading coherence time, resulting in negligible signaling and computational overhead. The per-AP aggregation operations (mean, max, min over users) are simple and scale linearly with the number of users.
Action: The action corresponds to selecting an AP activation pattern and is defined as a binary vector

$x = [x_{1}, x_{2}, \dots, x_{L}], x_{l} \in {0, 1},$

(12)

where $x_{l} = 1$ indicates that AP l is active and $x_{l} = 0$ denotes that it operates in sleep mode. This multi-binary action space directly reflects the physical ASO decision and avoids explicit enumeration of the $2^{L}$ possible activation configurations.
Reward: The reward is defined directly as the downlink energy efficiency achieved under the selected AP activation pattern. Specifically, for a given action $x$ , the reward is computed as

$r = EE (x),$

(13)

where $EE (x)$ is the energy efficiency metric defined in Section 3.6. Since small-scale fading varies independently across time steps, the reward effectively reflects the energy efficiency averaged over multiple small-scale channel realizations for a fixed large-scale topology.

5.2. PPO-Based Solution

The RL agent is trained using the PPO algorithm, a policy-gradient method widely adopted due to its robustness and stable optimization behavior under stochastic reward signals. In the considered setting, PPO is used to learn a stochastic policy for AP activation under a fixed large-scale network context, rather than to control a fully observable sequential dynamical system. This makes PPO a suitable choice for stabilizing policy updates in an episodic RL problem with partial observability and noisy performance feedback.

The policy is modeled as a stochastic function that outputs independent activation probabilities for each AP. During training, AP activation actions are sampled independently from the Bernoulli distributions parameterized by the policy outputs, which enables efficient exploration of the large combinatorial action space. During evaluation, a deterministic policy is obtained by activating AP l if its corresponding activation probability exceeds a fixed threshold of 0.5, ensuring reproducible and stable performance assessment. This formulation allows the agent to efficiently explore a large and diverse set of AP configurations and to progressively improve energy efficiency by averaging over multiple small-scale channel realizations. Although the action outputs are factorized across APs, the policy itself is conditioned on a global state representation and optimized using a system-level energy-efficiency reward. As a result, coordinated activation patterns can still emerge implicitly during learning through the shared observation and reward structure, while preserving the scalability advantages of the Bernoulli formulation.

Both the policy and value functions are implemented using fully connected neural networks that take the state representation described in the previous subsection as input. While graph-based architectures could explicitly model the AP-UE connectivity structure of cell-free massive MIMO systems, the adopted formulation prioritizes a compact state representation and scalable policy evaluation that grows linearly with the number of APs. Although the observation remains fixed within each episode, the value function provides a low-variance baseline for policy-gradient estimation when rewards are affected by stochastic small-scale fading. Entropy regularization is applied during training to encourage exploration and avoid premature convergence to suboptimal activation patterns.

The neural network architecture and the PPO training hyperparameters used throughout the experiments are summarized in Table 3. The PPO agent is implemented using the stable-baselines3 library (v2.3.0), interacting with a custom cell-free massive MIMO environment developed using Gymnasium (v0.28.1).

The selection of PPO is motivated by several structural properties of the ASO problem. First, the action space is multi-binary and grows exponentially with the number of APs, making value-based methods that rely on explicit action enumeration less scalable. PPO directly parameterizes independent activation probabilities for each AP, avoiding combinatorial action expansion. Second, the reward signal is affected by stochastic small-scale fading realizations, resulting in noisy performance observations. Policy-gradient methods such as PPO are known to provide stable updates under stochastic rewards through clipped objective functions and entropy regularization. Finally, the objective of this work is not to introduce a new RL algorithm, but to evaluate the ability of a well-established and robust policy-gradient method to approximate energy-efficient AP activation strategies under a unified and reproducible framework. Alternative RL formulations have been explored in the literature (see Table 1), often targeting different optimization objectives or adopting simplified power consumption models. In contrast, the contribution of this work is not centered on benchmarking RL optimizers, but on evaluating whether a robust and well-established policy-gradient method can capture the structural scalability of standalone ASO under a realistic and unified system-level power model.

5.3. Training Procedure

The RL agent is trained through repeated interactions with a simulated cell-free massive MIMO environment. A single PPO policy is trained per scenario and shared across all training episodes. Each episode corresponds to an independent network realization, in which the positions of APs and UEs are randomly generated and the associated large-scale fading coefficients are computed. These parameters remain fixed throughout the episode, reflecting the slow time scale at which ASO decisions are typically performed. This training strategy allows the agent to learn a policy that generalizes across different network topologies, rather than overfitting to a specific realization.

Within each episode, the agent performs a sequence of decision steps. At each step, a new realization of small-scale fading is generated, the agent selects an AP activation pattern based on the current state, and the resulting downlink energy efficiency is evaluated. Since the state remains unchanged within an episode, these repeated interactions allow the agent to observe stochastic variations in the reward induced by small-scale fading and to learn activation policies that perform well on average across channel realizations.

From this perspective, the learning problem can be interpreted as an episodic RL problem with a fixed context, closely related to a contextual bandit formulation, where PPO is employed as a stable policy-gradient optimizer under noisy reward observations. Policy updates are carried out using the PPO algorithm after collecting fixed-length rollouts from the environment. Standard PPO training settings are employed to ensure stable learning, as summarized in Table 3. Training is performed over multiple episodes until convergence is observed in terms of the achieved energy efficiency.

The total number of training interactions is adapted to the size of the considered scenario. As the dimensionality of the ASO problem increases with the number of APs, a larger training budget is required to ensure stable convergence of the learning process. Accordingly, the RL agent is trained for 100,000, 200,000, and 300,000 timesteps in Scenario 1, Scenario 2, and Scenario 3, respectively. The detailed characteristics of these scenarios are defined later in Section 6.1.

From an architectural perspective, the ASO decision process is assumed to be executed at the CPU, which already collects large-scale fading statistics and network-level information in typical cell-free massive MIMO deployments. The PPO policy is trained offline using simulated environments and subsequently deployed at the CPU for online inference. Since the state representation relies exclusively on large-scale fading coefficients and per-AP power parameters—quantities that evolve over relatively slow time scales—the computational burden associated with policy inference is minimal and compatible with centralized implementation. The resulting AP activation vector is then communicated to the distributed APs through existing fronthaul signaling mechanisms. This centralized decision-making assumption aligns with scalable cell-free architectures where high-level coordination and resource management are performed at the CPU, while low-level signal processing tasks remain distributed at the APs.

6. Methodology

This section describes the methodological framework adopted to evaluate the proposed RL-based ASO strategy. It includes the simulation environment and deployment assumptions, the evaluation scenarios considered in the numerical analysis, and the principles guiding the comparison between different ASO approaches. In addition, the performance metrics used to assess EE, robustness, and scalability are defined. The objective of this section is to ensure a fair, transparent, and reproducible evaluation of ASO strategies operating under different information and implementation constraints.

6.1. Simulations Setup

The simulation environment and parameter settings adopted in this work are selected to ensure realism, reproducibility, and fair scalability across different network sizes. The considered scenarios and system parameters are primarily based on the reference configurations reported in [25] (Table II), which have been widely adopted in recent studies on cell-free massive MIMO systems. In addition, the deployment and propagation assumptions are aligned with the recommendations provided by the Radiocommunication Sector of the International Telecommunication Union (ITU-R) for UMa environments [26].

All numerical results are obtained using a custom simulator implemented in Python (v3.10.11; Python Software Foundation, Wilmington, DE, USA). The simulator is adapted from the MATLAB (R2023a, 9.14.0.2337262, Update 5; MathWorks, Natick, MA, USA) code originally developed by Emil Björnson to reproduce the results reported in [1]. The Python implementation faithfully reproduces the complete downlink processing chain, including channel generation, channel estimation, precoding, SE computation, power consumption evaluation, and EE calculation, while enabling seamless integration with RL frameworks.

Each simulation run corresponds to an independent realization of the network topology. APs and UEs are placed uniformly at random over a square area. To avoid border effects and ensure statistically homogeneous spatial conditions, a wrapped-around topology is adopted, such that each UE experiences large-scale fading and interference as if surrounded by periodic replicas of the deployment.

To assess the scalability of ASO strategies, three evaluation scenarios with increasing network size and coverage area are considered. The deployment area is scaled with the number of APs in order to preserve approximately constant AP density across scenarios, thereby avoiding unrealistically dense deployments in large-scale configurations and artificially favorable interference conditions in small areas. The considered scenarios, denoted as Scenario 1, Scenario 2, and Scenario 3, are summarized in Table 4.

Unless otherwise stated, numerical results are obtained by averaging over multiple independent realizations of the network topology to ensure statistically reliable performance estimates. The general simulation and propagation parameters common to all scenarios are summarized in Table 5.

The antenna-to-user ratio and pilot configuration are intentionally kept consistent across all scenarios in order to isolate the scalability of ASO policies from variations in physical-layer dimensionality. By fixing these parameters, the analysis focuses on how AP activation strategies behave as the network size increases, while preserving comparable density and propagation conditions. Although varying the

N / K

ratio or pilot allocation strategy could provide additional sensitivity insights, the present study prioritizes controlled baseline configurations to highlight structural energy-efficiency trends under scalable deployments.

The considered evaluation scenarios are designed to capture structurally different deployment conditions while preserving comparable AP density across network sizes. By jointly scaling the coverage area and the number of APs and UEs, the analysis isolates the effect of network scale from artificial densification effects. Although the simulations focus on an UMa NLoS propagation model, the qualitative trade-offs observed in this study—namely the saturation of spectral efficiency with increasing infrastructure density and the near-linear scaling of fixed and circuit-related power consumption—are structural properties of distributed multi-point architectures. Therefore, the insights regarding the existence of intermediate energy-efficient operating regions and the scalability of learned ASO policies are expected to extend to other deployment conditions with similar large-scale fading statistics and infrastructure power scaling behavior.

6.2. Comparison Framework and Evaluation Philosophy

The evaluation of ASO strategies must account for the information available to each method and the practical constraints under which decisions are taken. In this work, the proposed RL-based approach is compared against two reference frameworks that represent implementable and widely used baselines.

The first reference framework corresponds to the all-on configuration, in which all APs remain active at all times. This scheme provides a simple and deterministic baseline that maximizes spatial diversity but incurs the highest power consumption. It serves as a lower reference in terms of energy efficiency and allows quantifying the potential gains achievable through AP deactivation.

The second reference framework is a random ASO strategy. In this case, AP activation decisions are generated randomly, without exploiting any information about the network state or the learned policy. Specifically, at each decision step, a random binary activation vector is drawn and evaluated using the same simulation and performance metrics as the RL-based approach. Importantly, this random strategy does not rely on the trained RL model and is used solely to characterize the performance achievable by unstructured and uninformed activation decisions.

These two baselines represent practical and implementable reference schemes operating under the same information constraints as the RL-based approach. In contrast, the greedy ASO strategy, which exploits full large-scale fading knowledge to optimize energy efficiency on a per-realization basis, is not considered a competing method but is instead used exclusively as an upper bound to contextualize the performance of practical ASO policies.

In addition to the baselines considered in this work, one could envision simple non-learning heuristics that exploit partial large-scale channel information similar to that used by the PPO policy. For example, APs could be ranked according to basic metrics derived from large-scale fading statistics (e.g., mean or maximum

{\tilde{β}}_{l, k}

, possibly normalized by the AP power consumption

P_{l}

), and a fixed number of top-ranked APs could be activated. Such heuristics provide useful intuition but require manual design choices, such as the selection of ranking criteria and the number of active APs, and do not naturally adapt across scenarios or operating points.

In this work, we deliberately focus on a minimal set of baselines that span from trivial implementable strategies (random and all-on) to an oracle upper bound (greedy ASO), in order to highlight distributional energy-efficiency behavior and scalability, rather than optimizing or tuning scenario-specific heuristics.

7. Results and Discussion

This section presents and discusses the numerical results obtained with the proposed ASO strategies. The analysis is structured to progressively build intuition about the problem before assessing the performance of the learning-based approach.

First, the behavior of baseline ASO schemes is examined to characterize the intrinsic trade-offs between throughput, power consumption, and EE as a function of the number of active APs. This preliminary analysis provides insight into the structure of the EE landscape and serves as a reference for interpreting the results obtained with RL. Subsequently, the performance of the proposed PPO-based ASO strategy is evaluated and compared against the considered baselines under identical simulation conditions.

7.1. Baseline Analysis

We first examine the behavior of baseline ASO strategies to characterize the intrinsic trade-offs between energy efficiency, spectral efficiency, and power consumption before introducing learning-based solutions. This analysis establishes reference operating regions that facilitate the interpretation of the subsequent RL results.

For clarity, the baseline behavior is illustrated using Scenario 1, which serves as a representative deployment. Similar qualitative trends are observed in larger-scale scenarios. Figure 1 shows the EE, total power consumption, and sum EE as a function of the number of active APs for the greedy upper bound and the random ASO lower bound.

The greedy ASO strategy achieves its maximum energy efficiency with a small number of active APs, peaking around

M \approx 4

–5. This sharp EE peak arises from the rapid saturation of downlink spectral efficiency in small-scale deployments, where a limited number of well-positioned APs already provide most of the available spatial diversity and array gain. Beyond this point, additional APs contribute only marginal throughput improvements while introducing nearly linear increases in fixed and hardware-related power consumption. As a result, the EE metric strongly favors minimal active infrastructure in this scenario. In contrast, the random ASO strategy exhibits consistently low energy efficiency across the entire range of active APs. Although the sum SE increases approximately linearly with number of active APs M, the lack of structure in the AP selection process prevents the effective exploitation of favorable propagation conditions.

It is also observed that the total power consumption curves of both strategies are nearly identical for a given number of active APs, indicating that differences in EE are primarily driven by the quality of the selected APs subset rather than by variations in consumed power.

Overall, this baseline analysis highlights a well-defined trade-off between spectral efficiency and power consumption in cell-free massive MIMO systems. While the greedy strategy serves as an informative upper bound, it relies on full large-scale fading knowledge and is not implementable in practical online systems. These observations motivate the use of learning-based approaches to identify robust and efficient operating regions under realistic information constraints.

7.2. Training Behavior

Figure 2 illustrates the evolution of the training reward as a function of the number of environment interaction steps for the three considered scenarios. In all cases, the PPO agent exhibits a stable learning behavior, with a clearly increasing trend in the smoothed reward and no signs of divergence or premature convergence. This indicates that the chosen reward definition and hyperparameter configuration allow the agent to progressively improve its energy-efficiency-oriented decisions despite the stochastic nature of the wireless channel.

For Scenarios 1 and 2, the reward evolution shows a consistent improvement throughout the training process, eventually stabilizing around a steady operating region. In the more demanding Scenario 3, where the baseline power consumption is significantly higher due to the increased number of APs and UEs, the reward values are lower in absolute terms, yet the learning dynamics remain stable. This confirms that the PPO agent is able to adapt its policy to increasingly complex and energy-costly deployments without collapsing to trivial activation patterns.

The energy efficiency achieved during training as a function of the number of active APs is depicted in Figure 3. For all scenarios, a clear bell-shaped relationship emerges, highlighting the fundamental trade-off underlying the ASO problem. Activating too few APs leads to insufficient spectral efficiency and poor coverage, whereas activating too many APs results in excessive power consumption dominated by fixed and hardware-related components. Consequently, the highest EE values are attained in an intermediate operating region.

A key observation is that the location of this energy-efficient operating region shifts systematically with the network scale. In Scenario 1, the highest EE values concentrate around

M \approx 15

active APs, while in Scenarios 2 and 3 the region of highest energy efficiency moves towards approximately

M \approx 30

and

M \approx 45

active APs, respectively. Despite this shift, the PPO-based policy tends to operate at moderate activation levels, corresponding to roughly half of the available APs in the considered scenarios. This behavior emerges naturally from the energy efficiency objective and reflects a scalable and self-adaptive activation strategy, rather than reliance on explicit constraints on the number of active APs.

The dispersion of EE values observed for a fixed number of active APs is a natural consequence of the underlying channel variability, as each training step corresponds to a different small-scale fading realization. Importantly, the PPO agent does not attempt to overfit instantaneous channel conditions, but instead learns a robust activation policy that captures the structural trade-off between spectral efficiency and power consumption. Overall, these results confirm that the proposed PPO-based ASO framework successfully internalizes the energy-efficiency characteristics of cell-free massive MIMO systems and scales coherently as the network size increases.

The training cost of the PPO-based policy is incurred offline and only once per scenario. In practice, the overall wall-clock training time remains well within practical limits, on the order of minutes to a few hours on a standard CPU-based implementation, depending on the selected training budget. For reference, convergence was achieved within approximately

10^{3}

–

10^{4}

s across the considered scenarios. The observed training time is influenced not only by the number of APs and training steps, but also by the computational cost of environment evaluation, rollout collection, and system-level factors, and therefore does not scale monotonically with the network size. Importantly, once training is completed, policy inference consists of a single forward pass through a lightweight neural network and incurs negligible computational complexity. While the all-on baseline requires no training and remains attractive when no learning infrastructure is available, it is consistently dominated in terms of energy efficiency whenever offline training is feasible.

7.3. Performance Evaluation

This subsection evaluates the performance of the proposed PPO-based AP activation policy and compares it against two baseline strategies: random AP activation and the all-on configuration, where all APs are simultaneously active. The evaluation focuses on EE as the primary metric, complemented by distribution-based indicators to assess robustness and reliability across different network realizations. Results are presented for three evaluation scenarios with increasing network size and density. First, a global comparison is provided through EE boxplots to highlight typical performance and variability. Then, CDFs are analyzed to capture the behavior of the lower and upper tails of the EE distribution. Finally, the relationship between EE and the number of active APs is examined to elucidate the underlying mechanisms driving the performance gains of the learned policy. Unless otherwise stated, all numerical results, including boxplots and CDFs, are computed over 300 independent network realizations, each corresponding to a different random topology. Increasing the number of realizations was observed to reduce variance but did not alter the qualitative trends reported in this section.

Figure 4 depicts the distribution of the achieved energy efficiency for the different AP activation strategies across the considered evaluation scenarios. A consistent trend can be observed in all cases: the proposed PPO-based policy achieves the highest median EE, outperforming both the random activation strategy and the all-on baseline under typical operating conditions. Specifically, PPO attains median EE values of 1.62 Mbit/J, 1.76 Mbit/J, and 0.42 Mbit/J in Scenarios 1, 2, and 3, respectively. These values correspond to improvements of 66.0%, 27.5%, and 39.4% over random AP activation, and 49.6%, 14.0%, and 25.4% over the all-on configuration. The largest relative gains are observed in Scenario 1, where dense deployments amplify the benefits of structured AP selection. It is worth noting that the absolute energy efficiency values observed in Scenario 3 are lower than in the smaller scenarios. This behavior is expected and mainly driven by the substantially higher total power consumption associated with larger-scale deployments, including a higher number of active APs, fronthaul links, and served users. Importantly, this reduction in absolute EE does not indicate a loss of effectiveness of the proposed method, but rather reflects a different operating regime in which the relative performance gains of intelligent AP activation remain clearly visible.

A particularly relevant observation is that the first quartile of the PPO-based policy consistently exceeds the median EE attained by the random activation baseline. This indicates that the proposed method outperforms random AP selection not only on average, but in the vast majority of realizations. While the random strategy may occasionally achieve EE values comparable to those of PPO in the extreme upper tail, the bulk of its probability mass remains concentrated at substantially lower EE levels. In contrast, the median performance of the random strategy never surpasses that of the all-on configuration, confirming that naive AP deactivation alone is insufficient to improve energy efficiency.

Overall, these results demonstrate that the gains achieved by the PPO-based approach stem from intelligent AP selection rather than from a mere reduction in the number of active APs. The learned policy consistently surpasses both baselines in terms of typical energy efficiency and robustness, thereby validating the effectiveness of the proposed RL framework.

Figure 5 illustrates the cumulative distribution functions (CDFs) of the energy efficiency for the different AP activation strategies under the three evaluation scenarios. This representation enables a detailed comparison of the reliability and statistical behavior of each method as the network size and density increase.

Across all scenarios, the PPO-based policy consistently achieves higher energy efficiency than the random activation baseline over a wide range of operating points. Although the gap between the two methods gradually narrows as the scenario complexity increases, the PPO-based approach maintains a systematic advantage in terms of average performance and distributional shift, confirming that the learned policy scales favorably to larger deployments.

While random AP activation may occasionally attain EE values in the extreme upper tail that are comparable to or even exceed those achieved by the PPO-based policy—reflecting sporadic fortunate AP selections—the bulk of the probability mass remains concentrated at substantially lower EE levels. This behavior is evidenced by the fact that the first quartile of the PPO distribution consistently exceeds the median EE achieved by random activation, confirming that structured policies are essential for reliable energy-efficient operation.

When compared with the all-on configuration, a complementary trend emerges. While the all-on strategy can offer competitive performance in the lower tail of the distribution due to its high spectral efficiency, its overall EE is fundamentally constrained by excessive power consumption. As a result, the PPO-based policy achieves a more balanced distribution, favoring energy-efficient operating points without incurring the substantial power overhead associated with activating all APs.

Taken together, the CDF results indicate that the PPO-based ASO policy does not aim at maximizing extreme or isolated outcomes, but rather at improving the likelihood of operating in energy-efficient regimes across a wide range of network realizations. This probabilistic advantage becomes particularly relevant as the network scales, and it motivates a closer inspection of how the learned policy balances energy efficiency and the number of active APs, which is examined next.

Figure 6 depict the energy efficiency achieved by the PPO-based policy and the random activation baseline conditioned on the same number of active APs, for the three considered evaluation scenarios. This representation is instrumental to isolate the effect of AP selection from the purely quantitative impact of the number of active APs.

In Scenario 1 (Figure 6a), both methods operate within an intermediate range of active APs (

M \approx 12

–20). For all observed values of M, the PPO-based policy consistently achieves higher EE values than the random activation strategy. This behavior indicates that, even in moderately sized deployments, informed AP selection already plays a decisive role, and that the gains provided by PPO are not driven by a reduction in the number of active APs, but by the activation of more energy-efficient AP subsets.

As the network size increases in Scenario 2 (Figure 6b), the operating range shifts toward larger values of M (≈28–38). In this case, the performance gap between PPO and Random ASO remains clearly visible across nearly the entire range of M, although with increased dispersion due to the higher variability of the environment. Importantly, the separation between both curves does not collapse as the number of APs increases, suggesting that the PPO-based policy learns an activation strategy that scales favorably with the system size and preserves a consistent advantage even as the decision space grows significantly.

In Scenario 3 (Figure 6c), corresponding to the largest and densest deployment, the range of active APs further expands (

M \approx 42

–56), and the variability of the EE naturally increases for both methods. Although the absolute EE gap between PPO and Random ASO becomes smaller in this scenario, the PPO-based policy continues to exhibit higher average EE values and a more favorable upper envelope for most values of M. Occasional realizations in which the random strategy attains comparable or slightly higher EE values are not problematic, as they stem from the stochastic nature of the baseline and do not affect the average behavior or the overall performance distribution.

Although the greedy ASO strategy introduced in Section 4.2 provides useful insight into the maximum achievable EE under full information, its operating point differs markedly from that of the PPO-based policy observed in Figure 6. In particular, Figure 1 shows that the greedy upper bound attains its maximum EE at very small activation levels (approximately

M \approx 4

–5 in Scenario 1), whereas the PPO-based policy typically operates at moderate activation levels, activating on the order of 12–20 APs and yielding median EE values around 1.6 bit/Joule. This apparent mismatch does not indicate a failure of the learning-based approach to identify energy-efficient configurations. Rather, it reflects the fundamentally different information assumptions and optimization settings: the greedy strategy exploits full large-scale fading knowledge and explicit combinatorial evaluation at decision time, while PPO learns a stochastic policy under partial observability and is optimized to perform robustly across a wide range of channel realizations. Consequently, the learned policy deliberately sacrifices peak EE achievable in isolated realizations in order to avoid aggressive AP deactivation that could lead to poor EE or QoS degradation under unfavorable conditions, resulting in stable operation around moderate activation levels.

8. Conclusions and Future Work

8.1. Conclusions

This paper investigated the problem of ASO in cell-free massive MIMO systems from an energy-efficiency perspective, proposing a reinforcement learning-based solution grounded on PPO. Unlike heuristic or oracle-based approaches, the proposed method learns a state-dependent AP activation policy under partial observability, aiming at maximizing long-term energy efficiency while remaining implementable in practical online systems.

Comprehensive numerical evaluations were conducted across multiple deployment scenarios of increasing scale and complexity. The results consistently show that the PPO-based policy significantly outperforms random AP activation, achieving substantial gains in average energy efficiency and markedly improved performance in the lower tail of the distribution. Importantly, these gains are obtained without reducing the average number of active APs, demonstrating that the proposed approach does not rely on trivial infrastructure savings, but rather on intelligent AP selection.

Specifically, the proposed PPO-based approach achieves median energy efficiency improvements of 66.0%, 27.5%, and 39.4% over random AP activation in Scenarios 1, 2, and 3, respectively. These gains are consistently observed across increasing network sizes, confirming that the learned policy scales effectively while maintaining robust energy-efficient operation.

When compared to the all-on baseline, which maximizes spectral efficiency at the expense of excessive power consumption, the PPO-based strategy achieves higher energy efficiency with a considerably lower power budget, especially in medium and large-scale deployments. This highlights the ability of the learned policy to exploit the saturation of spectral efficiency while avoiding unnecessary energy expenditure.

Furthermore, the analysis conditioned on the number of active APs revealed that, for a fixed network size, the PPO-based policy systematically achieves higher energy efficiency than random activation. This observation confirms that the performance improvements stem from selecting more energy-efficient AP subsets, rather than from activating fewer APs. The persistence of these gains across all considered scenarios demonstrates that the proposed approach scales favorably with the network size and remains effective even as the decision space grows substantially.

Overall, the results demonstrate that reinforcement learning can effectively capture the structural energy-efficiency trade-offs of cell-free massive MIMO systems and learn scalable AP activation strategies under partial observability. These findings highlight the potential of learning-based resource management approaches to support energy-efficient operation in dense distributed wireless infrastructures, which are expected to play a central role in future 6G communication and sensing-enabled networks.

8.2. Future Work

Future research may extend the proposed framework in several directions. First, richer observation spaces could be explored to assess whether incorporating additional large-scale fading statistics, traffic-aware indicators, or limited neighborhood information can further improve energy efficiency while preserving scalability. Second, alternative reward formulations may be investigated to balance energy efficiency with spectral efficiency or quality-of-service constraints, potentially enhancing robustness under heavily loaded conditions.

Another direction for future research consists of evaluating the proposed ASO framework under a broader set of deployment conditions, including alternative propagation environments (e.g., rural or indoor scenarios), heterogeneous AP hardware configurations, and varying traffic profiles. Such extensions would enable a more detailed sensitivity analysis of the energy-efficiency-scalability trade-off under diverse infrastructure assumptions and further assess the robustness of learning-based ASO policies beyond the standardized UMa configuration considered in this work.

Extending the framework to dynamic environments with user mobility and time-varying traffic represents another promising direction. Such settings would naturally introduce temporal correlations across decision epochs and could benefit from memory-based policy architectures, such as recurrent or LSTM-enhanced PPO formulations. Incorporating joint uplink-downlink optimization, as well as exploring full-duplex operation, would further enrich the ASO design space and enable coordinated resource allocation across communication directions. These extensions would allow modeling episodic large-scale evolution and dynamic decision-making under realistic network dynamics.

The proposed energy-efficient AP activation approach also provides a foundation for integrated sensing and communication (ISAC) scenarios in 6G networks, where scalable resource allocation can simultaneously support radar sensing coverage and multi-user downlink communication while maintaining sustainability. In addition, evaluating cross-scenario generalization and transfer learning capabilities would provide deeper insight into the robustness of learning-based ASO strategies across heterogeneous deployments. Finally, future work may address practical implementation aspects, including distributed control architectures and real-time inference analysis in large-scale networks.

Author Contributions

Conceptualization, G.G.-B., A.A. and M.F.; methodology, G.G.-B.; software, G.G.-B. and A.A.; validation, G.G.-B., A.A. and M.F.; formal analysis, G.G.-B.; investigation, G.G.-B. and A.A.; resources, G.G.-B. and A.A.; data curation, G.G.-B. and A.A.; writing—original draft preparation, G.G.-B.; writing—review and editing, G.G.-B. and M.F.; visualization, G.G.-B.; supervision, M.F.; project administration, M.F.; funding acquisition, M.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the grant from the Spanish ministry of economic affairs and digital transformation and the European Union—NextGenerationEU [UNICO I+D 6G/INSIGNIA] (TSI-064200-2022-006).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are generated through numerical simulations. The source code used to generate the results is publicly available at https://github.com/Fivecomm/cell-free-aso-ppo.git (accessed on 14 February 2026). No additional datasets were generated or analyzed during the current study.

Conflicts of Interest

Authors Guillermo García-Barrios and Manuel Fuentes were employed by the company 5G Communications for Future Industry Verticals S.L. (Fivecomm). Alberto Alonso was employed by IBM, Spain, and was also an employee of Fivecomm during part of the work. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Demir, Ö.T.; Björnson, E.; Sanguinetti, L. Foundations of User-Centric Cell-Free Massive MIMO. Found. Trends Signal Process. 2021, 14, 162–472. [Google Scholar] [CrossRef]
Chen, S.; Zhang, J.; Zhang, J.; Björnson, E.; Ai, B. A survey on user-centric cell-free massive MIMO systems. Digit. Commun. Netw. 2022, 8, 695–719. [Google Scholar] [CrossRef]
Elhoushy, S.; Ibrahim, M.; Hamouda, W. Cell-Free Massive MIMO: A Survey. IEEE Commun. Surv. Tutor. 2022, 24, 492–523. [Google Scholar] [CrossRef]
Feng, M.; Mao, S.; Jiang, T. Base Station ON-OFF Switching in 5G Wireless Networks: Approaches and Challenges. IEEE Wirel. Commun. 2017, 24, 46–54. [Google Scholar] [CrossRef]
Femenias, G.; Lassoued, N.; Riera-Palou, F. Access Point Switch ON/OFF Strategies for Green Cell-Free Massive MIMO Networking. IEEE Access 2020, 8, 21788–21803. [Google Scholar] [CrossRef]
Vu, T.X.; Chatzinotas, S.; ShahbazPanahi, S.; Ottersten, B. Joint Power Allocation and Access Point Selection for Cell-free Massive MIMO. In Proceedings of the ICC 2020—2020 IEEE International Conference on Communications (ICC), Dublin, Ireland, 7–11 June 2020; pp. 1–6. [Google Scholar] [CrossRef]
Jung, S.; Hong, S.E. Performance analysis of Access Point Switch ON/OFF schemes for Cell-free mmWave massive MIMO UDN systems. In Proceedings of the 2021 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 20–22 October 2021; pp. 644–647. [Google Scholar] [CrossRef]
Ito, M.; Kanno, I.; Amano, Y.; Kishi, Y.; Chen, W.Y.; Choi, T.; Molisch, A.F. Joint AP On/Off and User-Centric Clustering for Energy-Efficient Cell-Free Massive MIMO Systems. In Proceedings of the 2022 IEEE 96th Vehicular Technology Conference (VTC2022-Fall), London, UK, 26–29 September 2022; pp. 1–5. [Google Scholar] [CrossRef]
Hong, S.E.; Na, J.H. Joint Access Point Beamforming and Switch On/Off Scheme for Energy Efficient Cell-Free mmWave massive MIMO. In Proceedings of the 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 19–21 October 2022; pp. 730–732. [Google Scholar] [CrossRef]
Zhou, O.; Wang, J.; Liu, F.; Wang, J. Energy-Efficient Clustered Cell-Free Networking With Access Point Selection. IEEE Open J. Commun. Soc. 2024, 5, 1551–1565. [Google Scholar] [CrossRef]
Munawar, M.; Guenach, M.; Moerman, I. Performance and Architectural Tradeoffs in Scalable Cell-Free Massive MIMO. IEEE Access 2024, 12, 150189–150203. [Google Scholar] [CrossRef]
Mendoza, C.F.; Schwarz, S.; Rupp, M. Deep Reinforcement Learning for Dynamic Access Point Activation in Cell-Free MIMO Networks. In Proceedings of the WSA 2021; 25th International ITG Workshop on Smart Antennas, Sophia Antipolis, France, 10–12 November 2021; pp. 1–6. [Google Scholar]
Suh, H.; Oh, J.; Kang, S.; Hwang, T. DRL-Based AP Switch On/Off Scheme for Cell-Free Massive MIMO MEC Networks. In Proceedings of the 2023 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 11–13 October 2023; pp. 235–237. [Google Scholar] [CrossRef]
Sun, L.; Hou, J.; Chapman, R. Multi-Agent Deep Reinforcement Learning for Access Point Activation Strategy in Cell-Free Massive MIMO Networks. In Proceedings of the IEEE INFOCOM 2023—IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Hoboken, NJ, USA, 20 May 2023; pp. 1–6. [Google Scholar] [CrossRef]
Li, W.; Jiang, Y.; Huang, Y.; Zheng, F.C. Energy-Efficient Access Point Sleep Control in User-Centric Cell-Free Massive MIMO Systems. In Proceedings of the 2024 16th International Conference on Wireless Communications and Signal Processing (WCSP), Hefei, China, 24–26 October 2024; pp. 585–590. [Google Scholar] [CrossRef]
Xu, X.; Jiang, Y.; Huang, Y.; Zheng, F.C. A Nested DRL-Based Method for Power Allocation and AP Sleep Control in Cell-Free Massive MIMO Systems. In Proceedings of the 2024 10th International Conference on Computer and Communications (ICCC), Chengdu, China, 13–16 December 2024; pp. 1663–1668. [Google Scholar] [CrossRef]
Xu, J.; Wang, C.; Deng, D.; Li, Y.; Pang, M.; Zhang, Z.; Wang, D. Joint AP Scheduling and Power Allocation Based on Synergistic DRL for Cell-Free Massive MIMO. IEEE Commun. Lett. 2025, 29, 1082–1086. [Google Scholar] [CrossRef]
Tan, F.; Deng, Q.; Liu, Q. Energy-efficient access point clustering and power allocation in cell-free massive MIMO networks: A hierarchical deep reinforcement learning approach. EURASIP J. Adv. Signal Process. 2024, 2024, 18. [Google Scholar] [CrossRef]
Wu, Z.; Jiang, Y.; Huang, Y.; Zheng, F.C.; Zhu, P. Energy-Efficient Joint AP Selection and Power Control in Cell-Free Massive MIMO Systems: A Hybrid Action Space-DRL Approach. IEEE Commun. Lett. 2024, 28, 2086–2090. [Google Scholar] [CrossRef]
Masoudi, M.; Soroush, E.; Zander, J.; Cavdar, C. Digital Twin Assisted Risk-Aware Sleep Mode Management Using Deep Q-Networks. IEEE Trans. Veh. Technol. 2023, 72, 1224–1239. [Google Scholar] [CrossRef]
Suh, H.; Kang, S.; Hwang, T. Intelligent AP Control and Computation Offloading for Cell-Free Massive MIMO MEC Networks. In Proceedings of the 2024 International Conference on Electronics, Information, and Communication (ICEIC), Taipei, Taiwan, 28–31 January 2024; pp. 1–3. [Google Scholar] [CrossRef]
Wang, G.; Cheng, P.; Chen, Z.; Vucetic, B.; Li, Y. Green Cell-Free Massive MIMO: An Optimization Embedded Deep Reinforcement Learning Approach. IEEE Trans. Signal Process. 2024, 72, 2751–2766. [Google Scholar] [CrossRef]
Zaher, M.; Demir, Ö.T.; Björnson, E.; Petrova, M. Learning-Based Downlink Power Allocation in Cell-Free Massive MIMO Systems. IEEE Trans. Wirel. Commun. 2023, 22, 174–188. [Google Scholar] [CrossRef]
Salaün, L.; Yang, H. Deep Learning Based Power Control for Cell-Free Massive MIMO with MRT. In Proceedings of the 2021 IEEE Global Communications Conference (GLOBECOM), Madrid, Spain, 7–11 December 2021; pp. 1–7. [Google Scholar] [CrossRef]
Salaün, L.; Yang, H.; Mishra, S.; Chen, C.S. A GNN Approach for Cell-Free Massive MIMO. In Proceedings of the 2022 IEEE Global Communications Conference (GLOBECOM), Rio de Janeiro, Brazil, 4–8 December 2022; pp. 3053–3058. [Google Scholar] [CrossRef]
Series, M. Guidelines for Evaluation of Radio Interface Technologies for IMT-Advanced. Report ITU M.2135-1. Technical Report, International Telecommuncation Union, 2009. Available online: https://www.itu.int/dms_pub/itu-r/opb/rep/r-rep-m.2135-1-2009-pdf-e.pdf (accessed on 14 February 2026).
Chakraborty, S.; Manoj, B.R. Power Allocation in a Cell-Free MIMO System using Reinforcement Learning-Based Approach. In Proceedings of the 2023 National Conference on Communications (NCC), Guwahati, India, 23–26 February 2023; pp. 1–6. [Google Scholar] [CrossRef]
Zhao, Y.; Niemegeers, I.G.; De Groot, S.H. Deep Q-network based dynamic power allocation for cell-free massive MIMO. In Proceedings of the 2021 IEEE 26th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), Porto, Portugal, 25–27 October 2021; pp. 1–7. [Google Scholar] [CrossRef]
Zhao, Y.; Niemegeers, I.G.; De Groot, S.M.H. Dynamic Power Allocation for Cell-Free Massive MIMO: Deep Reinforcement Learning Methods. IEEE Access 2021, 9, 102953–102965. [Google Scholar] [CrossRef]
Bashar, M.; Akbari, A.; Cumanan, K.; Ngo, H.Q.; Burr, A.G.; Xiao, P.; Debbah, M.; Kittler, J. Exploiting Deep Learning in Limited-Fronthaul Cell-Free Massive MIMO Uplink. IEEE J. Sel. Areas Commun. 2020, 38, 1678–1697. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, J.; Buzzi, S.; Xiao, H.; Ai, B. Unsupervised Deep Learning for Power Control of Cell-Free Massive MIMO Systems. IEEE Trans. Veh. Technol. 2023, 72, 9585–9590. [Google Scholar] [CrossRef]

Figure 1. Baseline analysis in Scenario 1 illustrating the trade-off between EE, power consumption, and SE as a function of the number of active APs.

Figure 2. Evolution of the training reward as a function of the number of environment interaction steps for the considered evaluation scenarios.

Figure 3. Distribution of the EE as a function of the number of active APs observed during training for the considered evaluation scenarios.

Figure 4. Energy efficiency distribution for the different AP activation strategies under the considered evaluation scenarios.

Figure 5. Empirical CDF of the energy efficiency for the different AP activation strategies under the considered scenarios.

Figure 6. EE as a function of the number of active APs for the PPO-based policy and the random activation baseline. The blue and orange shaded areas represent the variance of the results around the mean values.

Table 1. Comparison of representative RL-based ASO approaches in cell-free massive MIMO systems.

Ref.	RL Alg.	Evaluation Scale	Power Model	Primary Focus	Key Limitations
[12]	DQN	Small-scale (8 APs)	None	QoS	Very small-scale; no EE; weak baselines
[20]	DQN	Moderate-scale	Basic	Delay, risk-aware	EE not optimized; DT-oriented framework
[13]	DDPG	Moderate-scale	Partial	Power minimization	EE not isolated; ASO not standalone
[14]	Multi-Agent DQN	Moderate-scale (50 APs)	Partial	Power minimization	No EE; limited scale; random baseline
[15]	PPO	Large-scale (~64 APs)	Basic	EE-oriented	Simplified power model; no EE-optimal AP analysis
[18]	Hierarchical DDPG	Moderate-scale	Partial	EE + clustering	Joint multi-level control; ASO not standalone
[21]	DDPG	Moderate-scale	Partial	Delay	Delay-driven; EE secondary
[22]	SAC + Graph Transformer	Moderate-scale (~40 APs)	Partial	Power minimization	EE not central; high complexity
[19]	SAC (hybrid)	Moderate-scale	None	EE + SE constraints	No power model; weak baselines
[16]	Nested Actor- Critic	Small-scale (16 APs)	Basic	EE-oriented	Very small-scale; basic power model
[17]	A2C	Small-scale (20 APs)	Basic	EE + SE constraints	Reduced scale; limited generality

Power model classification: None indicates that power consumption is not explicitly modeled. Basic refers to models that include only fixed or simplified AP circuitry and/or transmit power, without explicit fronthaul or UE-side power consumption. Partial denotes models that include AP power consumption and selected additional components, but do not capture the full network power model.

Table 2. Power consumption model parameters for downlink operation, adopted from [5].

Parameter	Value
AP power parameters
$P_{AP}^{fix, DL}$ (fixed power, active)	8.0 W
$P_{AP}^{sleep}$ (fixed power, sleep)	0.8 W
$P_{AP}^{chain, DL}$ (per RF chain, active)	0.2 W
$P_{AP}^{chain, sleep}$ (per RF chain, sleep)	0.02 W
$α_{AP}$ (PA efficiency)	0.39
UE power parameters
$P_{UE}^{fix, DL}$ (fixed reception power)	0.75 W
$ξ_{UE}$ (traffic-dependent processing)	0.25 W/Gbps
Fronthaul power parameters
$P_{FH}^{fix}$ (fixed power, active)	5.0 W
$P_{FH}^{sleep}$ (fixed power, sleep)	0.5 W
$ξ_{FH}$ (traffic-dependent power)	0.25 W/Gbps

Table 3. PPO training and network configuration parameters.

Parameter	Value
Learning rate	$3 \times 10^{- 4}$
Discount factor ( $γ$ )	$0.99$
Generalized advantage estimation (GAE) parameter ( $λ$ )	$0.95$
Clipping range	$0.2$
Entropy coefficient	$0.01$
Value function coefficient	$0.5$
Max. gradient norm	$0.5$
Rollout steps ( $n_{steps}$ )	2048
Batch size	256
Epochs per update	10
Policy/value network	MLP
Hidden layers	$2 \times 256$ (ReLU)
Training device	CPU

Table 4. Evaluation scenarios considered in the numerical results.

Scenario	APs (L)	UEs (K)	Area Side Length (m)
Scenario 1	32	6	500
Scenario 2	64	18	750
Scenario 3	96	30	1000

Table 5. General simulation and propagation parameters.

Parameter	Value
Deployment and general parameters
Antennas per AP (N)	4
Topology	Wrapped-around
Realizations	100
Radio parameters
Coherence block ( $τ_{c}$ )	200 symbols
Pilot length ( $τ_{p}$ )	20 symbols
Max. transmit power ( $p_{max}$ )	100 mW
Total DL power budget ( $ρ_{tot}$ )	200 mW
Bandwidth (B)	20 MHz
Noise figure	7 dB
Correlation and propagation model
Decorrelation factor	9 m
Antenna spacing	$0.5 λ$
ASD (azimuth)	$15^{\circ}$
ASD (elevation)	$15^{\circ}$
Shadow fading std. dev. ( $σ_{s f}$ )	8 dB
Carrier frequency ( $f_{c}$ )	2 GHz
AP height ( $h_{A P}$ )	10 m
UE height ( $h_{U E}$ )	1.65 m

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

García-Barrios, G.; Alonso, A.; Fuentes, M. Energy-Efficient Access Point Switch On/Off in Cell-Free Massive MIMO Using Proximal Policy Optimization. Electronics 2026, 15, 1219. https://doi.org/10.3390/electronics15061219

AMA Style

García-Barrios G, Alonso A, Fuentes M. Energy-Efficient Access Point Switch On/Off in Cell-Free Massive MIMO Using Proximal Policy Optimization. Electronics. 2026; 15(6):1219. https://doi.org/10.3390/electronics15061219

Chicago/Turabian Style

García-Barrios, Guillermo, Alberto Alonso, and Manuel Fuentes. 2026. "Energy-Efficient Access Point Switch On/Off in Cell-Free Massive MIMO Using Proximal Policy Optimization" Electronics 15, no. 6: 1219. https://doi.org/10.3390/electronics15061219

APA Style

García-Barrios, G., Alonso, A., & Fuentes, M. (2026). Energy-Efficient Access Point Switch On/Off in Cell-Free Massive MIMO Using Proximal Policy Optimization. Electronics, 15(6), 1219. https://doi.org/10.3390/electronics15061219

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Energy-Efficient Access Point Switch On/Off in Cell-Free Massive MIMO Using Proximal Policy Optimization

Abstract

1. Introduction

1.1. Problem Statement

1.2. Contributions

1.3. Structure

1.4. Reproducible Research

2. Related Work

3. System Model and Energy Efficiency Formulation

3.1. Channel Model

3.2. Channel Estimation

3.3. Downlink Data Transmission

3.4. Network Scalability

3.5. Power Consumption Model

3.6. Energy Efficiency Definition

4. Access Point Switch On/Off Problem Formulation and Baselines

4.1. Problem Formulation

4.2. Baseline Strategies

5. Reinforcement Learning Framework

5.1. State, Action, and Reward Definition

5.2. PPO-Based Solution

5.3. Training Procedure

6. Methodology

6.1. Simulations Setup

6.2. Comparison Framework and Evaluation Philosophy

7. Results and Discussion

7.1. Baseline Analysis

7.2. Training Behavior

7.3. Performance Evaluation

8. Conclusions and Future Work

8.1. Conclusions

8.2. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI