A Context-Conditioned Reinforcement Learning Framework for Space Frame Structure Optimization

Li, Yinbin; Xiao, Congzhen; Fan, Feng; Zhi, Xudong

doi:10.3390/buildings16122321

Open AccessArticle

A Context-Conditioned Reinforcement Learning Framework for Space Frame Structure Optimization

¹

Key Lab of Structures Dynamic Behavior and Control of the Ministry of Education, Harbin Institute of Technology, Harbin 150090, China

²

Key Lab of Smart Prevention and Mitigation of Civil Engineering Disasters of the Ministry of Industry and Information Technology, Harbin Institute of Technology, Harbin 150090, China

³

China Academy of Building Research, Beijing 100013, China

^*

Author to whom correspondence should be addressed.

Buildings 2026, 16(12), 2321; https://doi.org/10.3390/buildings16122321

Submission received: 30 April 2026 / Revised: 1 June 2026 / Accepted: 7 June 2026 / Published: 10 June 2026

(This article belongs to the Special Issue Structural Design and Analysis of Buildings)

Download

Browse Figures

Versions Notes

Abstract

With the rapid advancement of artificial intelligence in recent years, its adoption in structural design has been increasing. However, supervised learning in engineering design is often limited by the availability and quality of labeled data. To address this issue, this study proposes a context-conditioned deep reinforcement learning (DRL) method for the automated optimization of space frame structures, termed the Space Frame Optimization Agent (SFO-Agent). Instead of learning a policy for a single fixed design task, the proposed agent is conditioned on task-related context variables and trained through direct interaction with a finite element analysis environment, thereby avoiding reliance on pre-collected training datasets. In addition, a piecewise reward function is further formulated using code-prescribed limit values to achieve both structural safety and economic efficiency. Numerical experiments demonstrate that the proposed agent effectively captures reusable design policies under varying task contexts and produces high-quality, code-compliant designs across the full parameter domain. Comparisons with conventional optimization algorithms further demonstrate its superior efficiency and competitive optimization performance, indicating strong potential for broad engineering applications.

Keywords:

context-conditioned reinforcement learning; space frame structures; automated structural optimization; Soft Actor–Critic algorithm

1. Introduction

Space frame structures have been widely adopted in large public buildings due to their favorable mechanical performance and high design flexibility [1]. By connecting numerous members through joints arranged according to prescribed geometric rules, these systems form spatially distributed load-bearing structures [2]. In conventional practice, the design and optimization process of space frame structures typically comprises three major stages: shape optimization, mesh generation, and structural analysis [3,4]. However, these stages are often treated as relatively decoupled tasks, and an integrated optimization framework that coordinates them in a unified manner remains limited. Consequently, practical optimization still requires substantial manual effort and is time-consuming. To reduce dependence on engineers’ experience, a variety of optimization algorithms have been employed for automated design of space frame structures. Erbatur et al. [5] adopted the genetic algorithm (GA) to optimize frames and truss structures, while Delyová et al. [6] applied GA to optimize member sizing and topological relationships of truss systems. GA-based methods have also been specifically applied to the design optimization of gridshell structures [7,8]. In addition, simulated annealing and particle swarm optimization have also been explored for space-structure optimization [9,10,11,12]. In addition to optimization-oriented studies, related model-based computational methods have also been developed for the analysis and damage identification of steel truss structures [13,14]. Although these methods have achieved promising results, some limitations still remain with respect to computational cost and runtime. In addition, they are generally designed for a single instance in a one-shot manner. When the initial conditions change, the optimization process often needs to be carried out again, which may constrain their generalization capability.

In recent years, with the improvement of computational efficiency and the rapid development of machine learning techniques, an increasing number of learning-based methods have been introduced into structural engineering design, enabling efficient and accurate design of structural members [15,16,17] as well as prediction of member performance [18,19]. Machine learning methods have also been applied to the performance prediction and design of entire structural systems. de Lautour and Omenzetter [20] employed an artificial neural network (ANN) to overcome the efficiency limitations of nonlinear finite element analysis and achieved efficient prediction of seismic damage in reinforced concrete frame structures. Asgarkhani et al. [21] used 32 machine learning algorithms with hyperparameter tuning to predict the maximum residual drift ratio based on numerical simulation data from 384 steel frames, achieving an accuracy of 95%. Chang and Cheng [22] and Song et al. [23] proposed graph-based surrogate models for simple frames and elastic planar frames, providing feasible approaches for the application of machine learning to structural design and optimization. Zhao et al. [24] further employed a graph neural network (GNN) to perform the layout design of frame-structure beams. These studies indicate that machine learning methods have become capable of accomplishing certain specific engineering structural design tasks.

Driven by advances in convolutional neural networks (CNNs) [25], computers have gained the capability to recognize engineering drawings, making it feasible to use such drawings directly as training data. Pizarro et al. [26] compiled a database of 165 Chilean residential reinforced-concrete shear-wall building projects and trained a regression neural network to predict wall thickness and length from a 30-feature input vector describing the building. Huang and Zheng [27] trained a Pix2pixHD model on image data to automatically generate apartment floor plans, whereas Zheng et al. [28] trained a generative adversarial network (GAN) to generate architectural plans from input image boundaries. Liao et al. [29] further proposed a GAN-based shear-wall design method that learns from existing shear-wall design documents and enables rapid, intelligent structural design. Methodologically, these approaches are representative of supervised learning, where labeled data are used to learn an input–output mapping for prediction on unseen samples. Nevertheless, applying such approaches to the optimization of space frame structures poses several challenges: (1) the geometric forms of space frame structures are often highly complex, and drawing-based representations alone may be insufficient for accurate description; (2) supervised learning typically requires large volumes of labeled data, whereas datasets in the space frame domain are difficult to obtain and remain limited; (3) design variations across engineering projects are substantial, and predictive models tend to degrade when applied to cases outside the covered data distribution.

To address these limitations, reinforcement learning (RL) [30] offers a viable alternative. Unlike supervised learning, RL does not require pre-labeled datasets; instead, an agent improves its policy through interaction with an environment and the resulting feedback. It is widely applied to sequential decision-making and control tasks, such as robotic control, autonomous driving, and path planning. RL methods are broadly categorized into model-based [31,32] and model-free approaches. Model-free RL learns directly from data generated through interaction with the environment and is commonly categorized into two families: value learning and policy optimization. Value learning estimates a value function to evaluate states (or state–action pairs) and typically derives the policy implicitly via greedy action selection; representative algorithms include the Deep Q-Network [33] and the Dueling Deep Q-Network [34]. Policy optimization, by comparison, optimizes the policy explicitly, most commonly via gradient-based updates to improve expected return; representative algorithms include Trust Region Policy Optimization (TRPO) [35] and Proximal Policy Optimization (PPO) [36]. These two families offer complementary strengths: Value learning is typically more sample-efficient but can be sensitive to estimation error. Policy optimization is often more stable and better suited to continuous control, but more sensitive to data and hyperparameter tuning. To combine the advantages of value learning and policy optimization, the actor–critic approach was developed [37]. It learns a policy network (actor) while simultaneously training a value estimator (critic) to evaluate and guide policy updates, thereby improving overall performance. The Soft Actor–Critic (SAC) method [38], adopted in this study to construct the agent, is a representative actor–critic method. Deep reinforcement learning (DRL) is a class of methods that integrates deep neural networks into reinforcement learning for value function approximation and policy learning.

In structural design, the finite element method (FEM) serves as the environment by evaluating a given model and returning performance indicators. Accordingly, RL has been explored for tasks such as topology optimization of planar truss structures [39,40], design of reinforced-concrete (RC) beams [41], and optimization of simplified steel frames [42]. Du et al. [43] proposed a reinforcement learning environment specifically designed for the optimal design of steel members and structures, providing a platform for agents to learn efficient structural optimization strategies. Notably, most existing studies focus on member-level optimization or rely on simplified surrogate environments, and optimization of full structural models remains scarce. Moreover, conventional RL frameworks are typically developed for a fixed design setting, which limits their ability to generalize across varying design conditions. To address task variations, conditional RL and contextual MDP formulations have been developed, in which task-related variables are incorporated into the decision process so that the learned policy can adapt to different goals, environmental parameters, or task conditions [44,45]. However, the application of such context-conditioned formulations to structural design and optimization remains limited.

Motivated by the above considerations, this study proposes a context-conditioned DRL-based optimization method for space frame structures. A DRL environment is established by coupling parametric modeling with FEM, enabling performance improvement through the closed-loop interaction between the agent and the environment. In particular, the proposed framework introduces structural design conditions as context variables, allowing the agent to learn a policy that is adaptable to multiple design scenarios rather than a single fixed task. The main contributions are as follows:

Based on the Soft Actor–Critic framework in DRL, the SFO-Agent is developed to interact directly with FEM and achieve automated optimization design of space frame structures through training.
A DRL environment is constructed by integrating parametric modeling and FEM, providing an efficient state representation and enabling coordinated optimization of structural geometry and component engineering.
A piecewise reward function is formulated in accordance with relevant design-code provisions for space frame structures, guiding the optimization to satisfy safety requirements while reducing material consumption.
A context-conditioned off-policy training strategy is adopted, in which varying design conditions are incorporated into the training process, enabling the agent to learn reusable design policies and improve generalization across multiple structural design scenarios.

2. Methodology

The proposed SFO-Agent is developed within a context-conditioned DRL framework. As illustrated in Figure 1, the design conditions of each design case are first represented parametrically and mapped into a context, which is then concatenated with the environment state to form a complete augmented state. Subsequently, the SFO-Agent, consisting of an actor and critics, receives the state as input and outputs an action. The SFO-Environment, which integrates parametric modeling and finite element analysis, then performs automated structural modeling and analysis based on the given state and action, and obtains the corresponding structural performance indicators. Compliance is evaluated according to the code-prescribed limit values, and all resulting information is stored in the replay buffer as transitions. Thereafter, the SFO-Agent draws mini-batches of samples from the replay buffer to update its networks. By continuously iterating through this cycle, the performance of the agent is progressively improved.

2.1. SFO-Environment

In RL, the environment receives actions from the agent, updates its internal configuration accordingly, and returns the newly observed state together with a reward signal. In addition to widely used general-purpose environment toolkits [46], several studies have constructed RL environments for structural engineering design [39,43]. However, these environments have largely relied on surrogate models or simplified structural representations, such as planar frames and simple frame systems. By contrast, RL environments tailored to structural engineering problems that support the analysis of full three-dimensional structural models remain scarce. To enable training of the proposed optimization agent for space frame structures, it is therefore necessary to establish an RL environment that integrates structural modeling, finite element analysis, performance-index extraction, and reward evaluation.

By integrating parametric modeling [47] with FEM, this study develops an RL environment, termed SFO-Environment, that enables automated modeling and analysis of the entire structural system. The environment allows rapid modification of the model through a set of key parameters, while the effects of the agent’s actions are evaluated and returned through finite element analysis. As shown in Figure 2, the environment takes the state and action as inputs, and the combined tensor primarily consists of two components: Geometry data and simulation model data. The geometry data, which encode spatial information such as node coordinates and topology, are used to rapidly generate a geometric model. The simulation model data specify finite-element analysis inputs, including cross-sections, loads, and restraint conditions. By integrating these two data streams and transferring them to FEM software through an API, a complete FEM model can be automatically constructed. A reward function based on code-prescribed limits is then evaluated to compute the current reward, which is returned together with the next state. In the implementation, the parametric modeling module was developed using the Rhino 7.0–Grasshopper (version 1.0.0007) platform, while the finite element analysis was conducted in SAP2000 (version 21.0.2). The interaction between the parametric model and the FEM solver, including model generation, analysis execution, and result extraction, was implemented through the SAP2000 API.

2.2. State and Action

The state represents the environment condition and, for structural design, corresponds to the current set of design parameters, typically encoded as a tensor. RL training is conducted over multiple episodes. Within each episode, the agent explores the environment through a sequence of actions, after which the state is reset to start the next episode. In conventional RL frameworks, optimization is usually formulated for a fixed task, and the state is therefore reset to a fixed tensor at the beginning of each episode. In this study, however, the aim is not to optimize a single structure, but to learn a policy that generalizes across a family of design tasks within the same structural type. Inspired by the concept of contextual Markov decision processes [44], this study introduces a context variable into the state formulation to represent different design cases. As shown in Figure 3, the context is concatenated with the environment state to form an augmented state for training within each episode, while different contexts are randomly sampled from uniform distributions within their admissible ranges across episodes. In this way, the agent can learn optimization patterns under diverse design cases, thereby achieving improved generalization performance.

In structural applications, RL often faces varying state dimensionality due to changes in the number of structural members. To address this issue, the maximum considered structure strategy was proposed [42]. It fixes the state dimension to the maximum possible member layout, and assigns empty entries to positions where members are absent, thereby ensuring a consistent input size. This approach is effective for surrogate-based models with a limited number of members. However, for near-realistic full space frame models, it leads to excessively high-dimensional states, increasing agent complexity and making training difficult to converge. This study uses parametric representation to characterize structures within the same structural type. Specifically, the state of a given structure is represented by a small set of key parameters, rather than being encoded at the member level. As an illustrative example, the Kiewitt Dome can be represented by 9 parameters, whose definitions are summarized in Table 1. A typical state transition driven by these parameters is illustrated in Figure 4. When the action

a_{t}

is applied, the environment updates from

s_{t}

to

s_{t + 1}

. Although the number of structural members changes, the state dimensionality remains invariant. The selected state parameters are the direct input variables of the parametric modeling program rather than post-extracted features from an existing FEM model. For each admissible state vector, the parametric modeling procedure deterministically reconstructs the corresponding structural geometry, member topology, section assignment, and load conditions, which are then transferred to the FEM software for analysis. This state-to-model mapping was verified by reconstructing FEM models from sampled state vectors and confirming that the generated models could be successfully analyzed and return the required performance indicators. However, this representation does not aim to cover the entire real design space of complex spatial structures, especially non-parametric geometries, irregular local configurations, or member-wise independent design variables. In this study, the low-dimensional parametric representation is adopted to avoid variable-length member-level states and to improve training efficiency.

2.3. Reward Design

The reward is the scalar feedback produced after the environment receives an action and evaluates the resulting new state. In this study, reward computation follows the code [48] for global-structure and member-level checks. For the Kiewitt dome case described above, the corresponding indicators and their definitions are summarized in Table 2. The optimization objective can be stated as minimizing material usage while satisfying all code requirements. Accordingly, the reward terms can be grouped into two categories: (1) criteria subject to strict code limits, such as global deformation as well as member strength and stability checks; (2) quantities without explicit code limits but desirable to reduce, such as the material usage M.

Following the code [48], the displacement of the structure under design loads is normalized by the global-structure limit of 1/400, as given in Equation (1). Also, the first-order buckling factor, which characterizes global stability, is normalized using a limit value of

B F_{l i m} = 7.0

, as shown in Equation (2).

U Z_{norm} = \frac{a b s (U Z / D)}{1 / 400}

(1)

B F_{norm} = B F / B F_{l i m}

(2)

For code-limited criteria, satisfying the prescribed thresholds is sufficient; improvements beyond the limits should not further increase the reward. A clipping function defined in Equation (3) is applied to cap the reward once the requirements are met.

c l i p (x, a, b) = \{\begin{matrix} a \\ x \\ b \end{matrix} \begin{matrix} i f x < a \\ i f a \leq x \leq b \\ i f x > b \end{matrix}

(3)

\begin{array}{l} v_{UZ} = c l i p ((U Z_{norm} - 1.0), 0, 5) \\ v_{BF} = c l i p ((1.0 - B F_{norm}), 0, 5) \\ v_{Ri} = c l i p ((R_{i} - 1.0), 0, 5) \end{array}

(4)

v_{sum} = v_{UZ} + v_{BF} + \sum v_{Ri}

(5)

Equation (4) retains only the portions that exceed the code limits and quantifies the corresponding exceedance, which is then used to compute penalties. Equation (5) aggregates the exceedances from all criteria. If

v_{sum} > 0

, at least one requirement is violated; otherwise, all criteria satisfy the code limits. With a penalty shaping function defined in Equation (6), which maintains sensitivity to small exceedances and imposes stronger penalties on larger violations, the reward for code-infeasible actions can be expressed as Equation (7).

ϕ (v) = v + v^{2}

(6)

R = - ω_{UZ} ϕ (v_{UZ}) - ω_{BF} ϕ (v_{BF}) - \sum ω_{Ri} ϕ (v_{Ri})

(7)

When all performance criteria satisfy the code limits, the design is evaluated primarily by its material usage, which is also normalized to stabilize training. For the Kiewitt Dome, in addition to steel weight per unit area, effects such as the rise-to-span ratio and loading conditions are incorporated into the normalization, as given in Equation (8).

M_{norm} = R_{H D} \times \frac{M}{A r e a \times (D L + L L)}

(8)

Accordingly, the reward under code-compliant conditions is given by Equation (9), while the overall reward is given by Equation (10).

R = K - ω_{M} M_{norm}

(9)

R = \{\begin{array}{l} - ω_{UZ} ϕ (v_{UZ}) - ω_{BF} ϕ (v_{BF}) - \sum ω_{Ri} ϕ (v_{Ri}) & i f v_{sum} > 0 \\ K - ω_{M} M_{norm} & i f v_{sum} \leq 0 \end{array}

(10)

This piecewise reward design encourages the agent to prioritize code compliance during training, and then to pursue material reduction once feasibility is achieved. Moreover, the sign of the reward (i.e., whether it is greater than zero) provides an intuitive indicator of whether the current design satisfies the prescribed limits. The reward weights were selected according to a safety-priority principle. Since all code-related indicators are normalized by their corresponding limit values,

ω_{U Z}

,

ω_{B F}

, and

ω_{R i}

are used to balance the penalties for stiffness, global stability, and member strength/stability violations, respectively, and are all set to 1.5. The material-related weight

ω_{M}

is set to 5.0 and controls the material-reduction term after code compliance is achieved. These weights were determined through preliminary empirical tuning guided by engineering judgment.

In the proposed FEM–RL interaction, the reward is computed from deterministic FEM outputs rather than from a stochastic surrogate model. For the same state–action pair, fixed modeling procedures and solver settings generate consistent analysis results; therefore, reward uncertainty is mainly limited to minor numerical errors caused by rounding and solution tolerances, which have only a limited influence on the reward evaluation in this study.

2.4. SFO-Agent Design

The agent built in this study is not developed based on the more commonly used TRPO [35] or PPO [36] frameworks. The primary reason is that these methods are on-policy algorithms, for which the data collected in a given episode are discarded immediately after use, resulting in relatively low sample efficiency. In structural design tasks, however, finite element analysis is computationally expensive, and the introduction of context variables into the state to represent different design tasks further reduces repeated visits to the same state, thereby increasing the variance of learning. Inspired by the Soft Actor–Critic (SAC) [38] framework, the proposed agent, termed SFO-Agent, is formulated as a maximum-entropy RL framework with off-policy training. All collected data are stored in a replay buffer and repeatedly sampled for training, which greatly improves sample efficiency. In addition, maximum-entropy RL incorporates the entropy term in Equation (11) into both the policy objective and the value function, thereby promoting more diverse exploration and reducing the risk of convergence to a local optimum.

H (X) = E_{x ~ p} [- \log p (x)]

(11)

As summarized in Figure 5, the SFO-Agent comprises five deep neural networks (DNNs): an actor implemented by a policy network

π (a_{t} | s_{t})

, and critics implemented by two Q-networks

Q_{1} and Q_{2}

together with their corresponding target networks

q_{1} and q_{2}

. Data flow and parameter updates across these networks can be summarized into four parts:

Actor–environment interaction: Given the current state $s_{t}$ , the policy outputs an action $a_{t}$ , which is applied to the SFO-Environment to obtain the reward $r_{t}$ and the next state $s_{t + 1}$ . The transition tuple $〈s_{t}, a_{t}, r_{t}, s_{t + 1}〉$ is then stored in the replay buffer for subsequent training.
Critic update: Given $s_{t + 1}$ , the actor outputs the next action $a_{t + 1}$ and the entropy-related term $\log (π (a_{t + 1} | s_{t + 1})$ . Concatenating $s_{t + 1}$ and $a_{t + 1}$ forms the input to the target critics, which produce two target Q-values, $q_{1}$ and $q_{2}$ . Similarly, concatenating the sampled $s_{t}$ and $a_{t}$ provides the input to the critics, yielding the action-value estimates $Q_{1}$ and $Q_{2} .$ The target temporal difference and temporal difference error used for network updates are given in Equations (12) and (13).

T D_{target} = r_{t} + γ (\min (q_{1}, q_{2}) - α \log (π (a_{t + 1} | s_{t + 1})))

(12)

T D_{error, i} = \frac{1}{2} {(Q_{i} - T D_{target})}^{2}

(13)

3.: Soft update of the target critics: The target critics share the same network architecture as the critics, and this study adopts a soft update controlled by $τ$ , which incrementally moves the target parameters toward the critic parameters, as given in Equation (14), where $ω_{i}^{'}$ and $ω_{i}$ denote the parameters of the target critics and critics, respectively.

{ω_{i}}^{'} \leftarrow τ ω_{i} + (1 - τ) {ω_{i}}^{'}

(14)

4.: Actor update: The actor is updated under the guidance of the critics to favor higher-value actions. For sampled states $s_{t}$ , the actor produces a new action $a_{t}^{'}$ , which is evaluated by the critics to obtain $Q_{1}^{'}$ and $Q_{2}^{'}$ . The actor objective encourages both high action value and high entropy, leading to the loss function in Equation (15).

l o s s = α \log (π (a_{t}^{'} | s_{t})) - \min ({Q_{1}}^{'}, {Q_{2}}^{'})

(15)

By incorporating Prioritized Experience Replay [49] and adaptive entropy regularization into the training pipeline, the overall training procedure can be summarized by the pseudocode in Appendix A.

3. Experiments and Results

3.1. Computational Setup

To validate the effectiveness of the proposed SFO-Agent, a representative benchmark is selected for agent construction and training. The single-layer Kiewitt dome exhibits relatively uniform force distribution and high material efficiency [50] and has been widely used in practice. It is also among the commonly recommended space frame structural forms in the code [48]. Therefore, a single-layer Kiewitt dome shown in Figure 6 is adopted as the test case to evaluate the proposed agent.

Following the procedure described in Section 2.4, the components of the SFO-Agent are implemented and coupled via off-policy training. A parametric model is first developed for the Kiewitt dome, whose state space is defined by 9 parameters; their definitions and admissible ranges are provided in Table 3. Cross-section candidates are taken from the commonly used rectangular tube sections specified in the code [51], yielding 75 sections indexed from 0 to 74 in ascending size. Prior to being fed into the networks, all state parameters are linearly scaled and normalized to [−1, 1]. This linear rescaling is used only to transform the bounded actor output into feasible design variables and does not assume a linear relationship between the design variables and structural performance. The nonlinear effects of mesh size and section selection on stiffness, stability, stress ratio, and material usage are evaluated by the FEM analysis and reflected in the reward signal during training.

The first four dimensions of the state are treated as context variables to represent different design cases. The context variables were sampled within the admissible ranges listed in Table 3, which were selected to cover representative engineering design conditions of single-layer space frame structures, including different span scales, rise-to-span ratios, and roof load levels. The action is defined by the last five dimensions of the state, forming the tensor

[l_{1}, R_{l 1}, S_{1}, S_{2}, S_{3}]

, which specifies the adjustments applied to the Kiewitt dome mesh sizing and the cross-section of the member groups. Similar to state normalization, the actor outputs actions through a tanh activation to constrain them within [−1, 1], followed by dimension-wise rescaling to recover the corresponding physical values.

The reward follows the criteria defined in Section 2.3, aiming to guide the agent toward designs that satisfy the code [48] requirements on strength, stiffness, and global stability. The SFO-Agent comprises five DNNs, and the network hyperparameters are summarized in Table 4. For training, the number of episodes is set to 600, and the number of trial steps T per episode is 4, meaning that each randomly initialized structure is adjusted four times. The replay buffer warm-up size is 400; network updates start only after more than 400 transitions have been stored. The batch size for sampling from the replay buffer is 256. The target-critic soft update coefficient is

τ

= 0.003, the discount factor

γ

= 0.99, and the target entropy is set to −5.

3.2. Training Results

After 600 training episodes (approximately 12 h), the agent’s performance improves substantially and reaches a stable level. The episode-averaged reward is smoothed using a moving average with a window size of 3 and plotted in Figure 7. According to the reward definition, the reward becomes negative in two situations: when any code requirement is violated, or when the design is code-compliant but exhibits excessive material usage. Both are regarded as undesirable solutions; therefore, reward = 0 is adopted as the code threshold for evaluation, as indicated by the red dashed line in Figure 7.

The off-policy training procedure consists of two stages. Before the replay buffer reaches its minimum size, the agent only interacts with the environment to collect sufficient transitions, and no network updates are performed; this is the replay buffer warm-up stage. The second stage is the gradient-updating phase, during which the networks are updated using minibatches sampled from the replay buffer while newly generated transitions are continuously appended.

Figure 7 shows that during the replay buffer warm-up stage, the untrained agent generates low-quality designs, and almost none satisfy the code requirements. After approximately 100 episodes, training enters the gradient-updating phase, where the networks begin learning from replayed data. A clear performance improvement emerges around episode 130, and by about episode 200 the agent is able to generate code-compliant designs. Thereafter, it further optimizes material usage to obtain higher rewards. From episodes 350 to 600, the mean reward fluctuates at a consistently high level and remains well above the code threshold.

To evaluate the training stability and reproducibility of the proposed method, five independent training runs were conducted using different random seeds while keeping the network architecture, hyperparameters, reward function, and FEM environment unchanged. Figure 8 shows the mean episode reward over the five runs, with the shaded region indicating one standard deviation. In the later training stage, the reward remains mostly above zero, while the reduced standard deviation indicates that different random-seed runs converge to a similar high-reward region. These results demonstrate that the proposed training process is reproducible and reasonably stable despite the stochastic nature of RL training.

Compared with conventional DRL framework, the mean reward of the SFO-Agent still exhibits noticeable fluctuations in the late training stage. This behavior is expected because the proposed context-conditioned framework optimizes a family of structures rather than a single fixed instance. Nevertheless, the trained agent produces designs with rewards predominantly above zero, indicating that it has learned effective layout patterns for space frame structures and can generate safe and material-efficient solutions under varying initial conditions, with favorable generalization capability.

3.3. Typical Case Analysis

In this section, the trained SFO-Agent is applied to a representative design case and the resulting solutions are discussed. A Kiewitt dome with fixed span, rise-to-span ratio, and loading is selected to be analyzed. The case considers a 50 m span, a rise-to-span ratio of 1/7, dead and live loads of 1.5 kN/m² and 0.5 kN/m², respectively; the initial model is shown in Figure 9. The trained agent performs a 10-step decision sequence for this case. After each step, the structural responses are evaluated and the reward is computed, and the design with the highest reward among the 10 steps is selected as the final solution. The stepwise actions and corresponding design results are reported in Table 5.

Across the design process, the agent selects each action based on the previous state to seek higher-reward solutions. As shown in Table 5, all 10 steps satisfy the code limits. The best solution is obtained at Step 4 with a reward of 12.44, corresponding to

l_{1} = 4.4 m

and

R_{l 1} = 1.4

. The selected section indices for radial, circumferential, and diagonal members are 51, 29, and 28, respectively, corresponding to rectangular tube sections 400 × 200 × 8, 200 × 150 × 5 and 220 × 140 × 4, with a total material usage of 42.67 t.

The associated finite element results are shown in Figure 10: the first-order buckling factor is 10.46 (above the 7.0 limit), the maximum vertical deflection is −19.45 mm, giving a deflection-to-span ratio of 1/2570 (well below 1/400), and the maximum member stress ratio is 0.77. Overall, the design approaches but does not exceed the limit, indicating a material-efficient solution. Notably, the agent does not merely refine designs near the best-scoring region; it also explores distant but competitive actions (e.g., Step 3), which adopts a denser mesh with smaller sections and remains code-compliant, albeit with slightly higher steel usage than Step 4. This behavior suggests that the agent, by encouraging higher action entropy, promotes broader exploration and helps avoid poor local optima.

4. Discussion of Agent Performance

4.1. Agent Evaluation

To further evaluate the agent, 100 design cases are randomly generated and tested. For each case, the highest-reward design among the 10 steps is taken as the final outcome. The results for all 100 cases are summarized in Figure 11.

Across the 100 randomly generated test cases, the agent achieves a mean reward of 10.16, and all cases yield rewards greater than zero, indicating code-compliant designs with relatively low material usage. This further suggests that the SFO-Agent learns effective design strategies through interaction with the finite element solver and can rapidly generate solutions under varied design conditions. Nevertheless, several cases exhibit noticeably lower rewards, such as Case 5 and Case 56 with rewards of 1.76 and 1.84, corresponding to states [69,000, 0.21, 0.0006, 0.0006, 3900, 1.0, 28, 21, 19] and [50,000, 0.16, 0.0003, 0.0006, 4200, 0.9, 23, 16, 22], respectively. A common feature is relatively small loads. Under such conditions, the design is no longer controlled by stress ratio but by slenderness limit, leading to low load-carrying efficiency. Thus, the low rewards are primarily attributable to the specific design conditions. Even under relatively demanding constraints, the agent still produces code-compliant designs, albeit with lower rewards.

The trained agent is evaluated on unseen Kiewitt dome cases with different spans, rise-to-span ratios, and loading conditions within the predefined parameter ranges. This is the generalization capability discussed in this study, namely the agent’s adaptability to different contexts within the same parametric structural type. It should be distinguished from topological generalization across different structural systems, which is not the focus of the present study.

The above results indicate that the agent can yield lower rewards under certain design conditions, leading to performance variability. To further assess its behavior across the full parameter domains, additional samples are generated by randomly sampling design conditions while sweeping the variable of context over its entire admissible range.

For each context dimension, 20 samples are generated, yielding 20 reward curves per dimension; the corresponding mean curve is also plotted, as shown in Figure 12. Overall, the agent consistently produces code-compliant designs, with no violations observed. The rewards are relatively stable and remain at high levels with respect to structural diameter and rise-to-span ratio. In contrast, higher dispersion is observed along the dead-load and live-load dimensions: rewards are lower at small load levels and tend to increase as the loads increase. This trend is attributed to a shift in governing constraints from detailing-driven requirements at low loads to stress-governed design at higher loads, which improves load-carrying efficiency and thus increases the reward. These results indicate that a well-trained agent achieves robust performance across the full parameter domains and satisfies typical design requirements for space frame structures.

4.2. Algorithm Comparison

To justify the selection of SAC, a comparative ablation study was conducted using PPO as a representative on-policy baseline. The PPO agent was trained under the same SFO-Environment, state–action definition, reward function, parameter ranges, and finite-element interaction budget as SAC. As shown in Figure 13, both algorithms show an increasing reward trend during training, but SAC reaches the code-compliant reward region earlier and maintains a higher reward level in the later training stage. By contrast, PPO exhibits slower reward improvement and larger fluctuations, with several drops below the code threshold even in the later stage. This comparison indicates that SAC provides better sample efficiency and training stability for the proposed FEM-interaction-based structural optimization task, mainly due to its off-policy replay-buffer reuse and entropy-regularized exploration mechanism.

Another way to assess the performance of the SFO-Agent is to compare it with a conventional optimization algorithm. Three representative Kiewitt single-layer dome cases are selected and designed using both a genetic algorithm (GA) and the trained agent. The case parameters are summarized in Table 6, corresponding to spans of 25 m (short), 50 m (medium), and 70 m (large).

The GA uses tournament selection with a tournament size of 4, a crossover rate of 0.9, and a mutation rate of 0.2, with a population size of 40. Fitness is evaluated using the same reward function as SFO-Agent. The search is terminated when the best fitness in the population remains unchanged for three consecutive generations, indicating convergence.

For the three cases, the runtimes and key outcomes of GA and the SFO-Agent are summarized in Table 7. The GA is a case-specific iterative search method that must restart the optimization process whenever the design conditions change. In contrast, the proposed context-conditioned DRL framework requires a relatively long training stage (approximately 12 h), but the trained agent can be repeatedly used for new design cases with only a small number of trial evaluations. As a result, during the inference stage, the GA requires approximately 30–40 times the design time of the SFO-Agent, and its computational cost increases more markedly with model scale.

In terms of design quality, the two methods are comparable overall, and both achieve code compliance with relatively high rewards. For the short-span Case 1 and the medium-span Case 2, the rewards are nearly identical. For the large-span Case 3, where the structural complexity increases with more members, SFO-Agent outperforms GA by a clear margin; the SFO-Agent design uses 12.24 t less material, corresponding to a reduction of approximately 16.58%.

Overall, both SFO-Agent and the conventional GA demonstrate strong performance in optimizing space frame structures, achieving material reduction while satisfying code requirements. By incorporating action entropy, SFO-Agent promotes broader exploration and can deliver superior solutions in more challenging cases. Moreover, the pretraining-and-reuse paradigm is substantially more efficient than GA for large batches of similar design tasks, leading to significant savings in design time.

4.3. Extension to Other Structural Types

To further validate the applicability of SFO-Agent to space frame structural optimization, the proposed approach is extended to three additional common structural forms: single-layer lamella cylindrical reticulated shell (SL-LCRS), single-layer hyperbolic paraboloid reticulated shell (SL-HPRS), and single-layer sunflower-type spherical reticulated shell (SL-SSRS). The corresponding structural layouts are shown in Figure 14.

Following the proposed methodology, a corresponding parametric environment and SFO-Agent are constructed and trained separately for each new structural type, because different structural forms may have different geometric generation rules, state–action mappings, and design-variable meanings. After tuning hyperparameters such as learning rates, each agent is trained via off-policy training and evaluated on 100 randomly generated design cases. The training average reward curves and test results are presented in Figure 15. It can be observed that: Across the three structural forms, the SFO-Agents show rapid learning progress, with a marked improvement emerging around 150–200 episodes. After approximately 250 episodes, the mean reward remains consistently above the code threshold and gradually stabilizes. In the 100-case randomized evaluation, the agents achieve mean rewards of 11.11, 18.47, and 10.92, respectively, indicating that the learned policies can reliably generate code-compliant and material-efficient designs. The results also show that the learning difficulty varies among different structural types. This difference is mainly related to the geometric generation rules, force-transfer mechanisms, and feasible design-space sizes of the corresponding structures. When the structural response changes smoothly with respect to mesh and section variables, the agent can more easily learn stable optimization patterns. By contrast, structures with stronger sensitivity to geometric or section changes may lead to a less smooth reward landscape and therefore require more exploration during training. Notably, a small number of low-reward cases are observed, and the SL-HPRS tests include a few cases below the code threshold, suggesting occasional failures under specific design conditions. However, because RL operates through stepwise interaction rather than one-shot prediction, such low-quality outcomes are immediately revealed by environment feedback and can be handled through targeted post-processing or additional refinement.

5. Conclusions

In this study, we developed a context-conditioned DRL framework for the optimization of space frame structures, termed SFO-Agent. An interactive SFO-Environment was constructed by integrating parametric modeling with FEM. Through iterative training, SFO-Agent autonomously interacts with the FEM solver and learns the underlying relationships between design variables and structural performance, thereby progressively improving its design capability. Experimental results demonstrate the feasibility of SFO-Agent for optimizing space frame structures and suggest strong potential for practical engineering applications. The detailed conclusions are as follows.

The SFO-Agent was established by coupling an actor network with critics and target critics. Parametric modeling, FEM, and a code-compliant reward function were then developed and embedded into the SFO-Environment. Off-policy training within a context-conditioned framework enabled continuous interaction between SFO-Agent and SFO-Environment, allowing the agent to learn reusable policies across varying design cases for automated optimization of space frame structures.
A single-layer Kiewitt dome was selected as the benchmark, and the 600-episode off-policy training results demonstrated the progressive improvement of the agent. The analysis of a representative design case verified the rationality of the learned strategy. In addition, tests on 100 randomly generated cases and full-domain sweeps across context parameters showed robust performance over varying design cases.
Compared with the GA, the context-conditioned DRL agent learned reusable design patterns during training and therefore required substantially less time for optimization during the inference stage. The maximum-entropy formulation enhanced action exploration, leading to improved performance on more challenging models. Moreover, the proposed method was extended to several additional forms of space frame structures, further supporting its effectiveness.
In this study, the SFO-Agent was evaluated on common forms of space frame structures to demonstrate its effectiveness, while its applicability to a broader range of structural types still requires further extension, particularly for non-standard configurations. In addition, the current generalization capability mainly refers to adaptation to different design contexts within the same parametric structural type, rather than direct topological generalization across different structural systems. The test results also indicate that, under certain design conditions, the agent may produce low-reward solutions, suggesting that the robustness and overall optimization performance of the proposed framework should be further improved.

Author Contributions

Conceptualization, Y.L. and C.X.; methodology, Y.L.; software, Y.L.; validation, Y.L., C.X., F.F. and X.Z.; formal analysis, Y.L.; investigation, X.Z.; resources, C.X.; data curation, C.X.; writing—original draft preparation, Y.L.; writing—review and editing, C.X. and F.F.; visualization, Y.L.; supervision, X.Z.; project administration, C.X.; funding acquisition, C.X. All authors have read and agreed to the published version of the manuscript.

Funding

The work described in this paper was supported by the National Key Research and Development Program of China (Grant No. 2023YFC3805005), and Research Foundation of China Academy of Building Research (Grant No. 2025-012C021507-001).

Data Availability Statement

Data will be available on request.

Conflicts of Interest

Authors Yinbin Li and Congzhen Xiao were employed by the company China Academy of Building Research. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

Algorithm A1 Training procedure of the SFO-Agent
Given: $episodes N, \max step T, batch size B, replay buffer size N_{RB}$ $, minimal size N_{\min}$ $, discount factor γ$ $, soft update factor τ$ $, learning rate lr (lr π, lr Q, lr α), target entropy H_{0}$
Initialize: $actor π_{θ}$ $, critics Q_{ω 1}$ $, Q_{ω 2}$ $, target critics q_{{ω 1}^{'}}$ $, q_{{ω 2}^{'}}$ , replay buffer RB
for episode = 1, …, N do
$Generate a random context c$
for t = 1, …, T do
$a_{t} ~ π_{θ}$
$SFO - Environment provides r_{t}, s_{t + 1}$ $based on s_{t}, a_{t}$
$Save the experience (s_{t}, a_{t}, r_{t}, s_{t + 1})$ in RB
$s_{t} \leftarrow s_{t + 1}$
$N_{RB} \leftarrow N_{RB} + 1$
if $N_{RB} > N_{\min}$ then
${(S}_{t} {, A}_{t} {, R}_{t} {, S}_{t + 1} {), W}_{IS}$ ~ RB	➢ Sample a batch of size B from RB
$A_{t + 1} ~ π_{θ} (• \| S_{t + 1})$
$T D_{target} = R_{t} + γ (\min (q_{{ω 1}^{'}}, q_{{ω 2}^{'}}) - α \log (π_{θ} (A_{t + 1} \| S_{t + 1})))$	➢ Computation of the Q target
$L (Q_{1}) = \sum (W_{IS} * {(Q_{ω 1} (S_{t}, A_{t}) - T D_{target})}^{2}) / {\sum W}_{IS}$	➢ Loss of critic1
$L (Q_{2}) = \sum (W_{IS} * {(Q_{ω 2} (S_{t}, A_{t}) - T D_{target})}^{2}) / {\sum W}_{IS}$	➢ Loss of critic2
$ω_{1} \leftarrow ω_{1} - l r Q * \nabla_{ω 1} L (Q_{1})$	➢ Update of critic1
$ω_{2} \leftarrow ω_{2} - l r Q * \nabla_{ω 2} L (Q_{2})$	➢ Update of critic2
$A_{t}^{'} ~ π_{θ} (\cdot \| S_{t})$
$L (π) = α \log (π_{θ} (A_{t}^{'} \| S_{t})) - \min (Q_{ω 1} (A_{t}^{'} \| S_{t}), Q_{ω 2} (A_{t}^{'} \| S_{t}))$	➢ Loss of the actor
$θ \leftarrow θ - l r π * \nabla_{θ} L (π)$	➢ Update of the actor
$ω_{1}^{'} \leftarrow (1 - τ) ω_{1}^{'} + τ ω_{1}$ $, ω_{2}^{'} \leftarrow (1 - τ) ω_{2}^{'} + τ ω_{2}$	➢ Soft update of target critics
$L (α) = - α \log (π_{θ} (A_{t}^{'} \| S_{t})) - α H_{0}$	➢ Loss of α
$α \leftarrow α - l r α * \nabla_{α} L (α)$	➢ Update of α
Update PER priorities based on TD error
end if
end for
end for

References

Narayanan, S. Space Structures: Principles and Practice; Multi-Science Publishing Co., Ltd.: Brentwood, CA, USA, 2006; ISBN 978-0-906522-42-4. [Google Scholar]
Dong, S.L. Analysis Design and Construction of New Space Structures; People’s Communications Publishing House Co., Ltd.: Beijing, China, 2006. [Google Scholar]
Aldwaik, M.; Adeli, H. Advances in Optimization of Highrise Building Structures. Struct. Multidiscip. Optim. 2014, 50, 899–919. [Google Scholar] [CrossRef]
Mei, L.; Wang, Q. Structural Optimization in Civil Engineering: A Literature Review. Buildings 2021, 11, 66. [Google Scholar] [CrossRef]
Erbatur, F.; Hasançebi, O.; Tütüncü, İ.; Kılıç, H. Optimal Design of Planar and Space Structures with Genetic Algorithms. Comput. Struct. 2000, 75, 209–224. [Google Scholar] [CrossRef]
Delyová, I.; Frankovský, P.; Bocko, J.; Trebuňa, P.; Živčák, J.; Schürger, B.; Janigová, S. Sizing and Topology Optimization of Trusses Using Genetic Algorithm. Materials 2021, 14, 715. [Google Scholar] [CrossRef]
Qin, L.; Huang, W.; Du, Y.; Zheng, L.; Jawed, M.K. Genetic Algorithm-Based Inverse Design of Elastic Gridshells. Struct. Multidiscip. Optim. 2020, 62, 2691–2707. [Google Scholar] [CrossRef]
Tomei, V.; Grande, E.; Imbimbo, M. Design Optimization of Gridshells Equipped with Pre-Tensioned Rods. J. Build. Eng. 2022, 52, 104407. [Google Scholar] [CrossRef]
Lamberti, L. An Efficient Simulated Annealing Algorithm for Design Optimization of Truss Structures. Comput. Struct. 2008, 86, 1936–1953. [Google Scholar] [CrossRef]
Li, L.J.; Huang, Z.B.; Liu, F.; Wu, Q.H. A Heuristic Particle Swarm Optimizer for Optimization of Pin Connected Structures. Comput. Struct. 2007, 85, 340–349. [Google Scholar] [CrossRef]
Luh, G.C.; Lin, C.Y. Optimal Design of Truss-Structures Using Particle Swarm Optimization. Comput. Struct. 2011, 89, 2221–2232. [Google Scholar] [CrossRef]
Tsiptsis, I.N.; Liimatainen, L.; Kotnik, T.; Niiranen, J. Structural Optimization Employing Isogeometric Tools in Particle Swarm Optimizer. J. Build. Eng. 2019, 24, 100761. [Google Scholar] [CrossRef]
Xiao, F.; Mao, Y.; Tian, G.; Chen, G.S. Partial-Model-Based Damage Identification of Long-Span Steel Truss Bridge Based on Stiffness Separation Method. Struct. Control Health Monit. 2024, 2024, 5530300. [Google Scholar] [CrossRef]
Mao, Y.; Xiao, F.; Tian, G.; Xiang, Y. Sensitivity Analysis and Sensor Placement for Damage Identification of Steel Truss Bridge. Structures 2025, 73, 108310. [Google Scholar] [CrossRef]
Charalampakis, A.E.; Papanikolaou, V.K. Machine Learning Design of R/C Columns. Eng. Struct. 2021, 226, 111412. [Google Scholar] [CrossRef]
Cheng, J.; Li, X.; Jiang, K.; Li, S.; Su, A.; Zhao, O. Machine-Learning-Assisted Design of High Strength Steel I-Section Columns. Eng. Struct. 2024, 308, 118018. [Google Scholar] [CrossRef]
Huang, X.; Jiang, K.; Zhao, O. Unified Machine-Learning-Aided Design of Cold-Formed Steel Channel Section Columns with Different Buckling Modes at Ambient and Elevated Temperatures. Eng. Struct. 2024, 320, 118875. [Google Scholar] [CrossRef]
Marie, H.S.; Abu el-hassan, K.; Almetwally, E.M.; El-Mandouh, M.A. Joint Shear Strength Prediction of Beam-Column Connections Using Machine Learning via Experimental Results. Case Stud. Constr. Mater. 2022, 17, e01463. [Google Scholar] [CrossRef]
Wang, S.; Xu, J.; Wang, Y.; Pan, C. Machine Learning-Based Prediction of Shear Strength of Steel Reinforced Concrete Columns Subjected to Axial Compressive Load and Seismic Lateral Load. Structures 2023, 56, 104968. [Google Scholar] [CrossRef]
de Lautour, O.R.; Omenzetter, P. Prediction of Seismic-Induced Structural Damage Using Artificial Neural Networks. Eng. Struct. 2009, 31, 600–606. [Google Scholar] [CrossRef]
Asgarkhani, N.; Kazemi, F.; Jankowski, R. Machine Learning-Based Prediction of Residual Drift and Seismic Risk Assessment of Steel Moment-Resisting Frames Considering Soil-Structure Interaction. Comput. Struct. 2023, 289, 107181. [Google Scholar] [CrossRef]
Chang, K.-H.; Cheng, C.-Y. Learning to Simulate and Design for Structural Engineering. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 1426–1436. [Google Scholar]
Song, L.; Wang, C.; Fan, J.; Lu, H. Elastic Structural Analysis Based on Graph Neural Network without Labeled Data. Comput.-Aided Civ. Infrastruct. Eng. 2023, 38, 1307–1323. [Google Scholar] [CrossRef]
Zhao, P.; Liao, W.; Huang, Y.; Lu, X. Intelligent Beam Layout Design for Frame Structure Based on Graph Neural Networks. J. Build. Eng. 2023, 63, 105499. [Google Scholar] [CrossRef]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Pizarro, P.N.; Massone, L.M.; Rojas, F.R.; Ruiz, R.O. Use of Convolutional Networks in the Conceptual Structural Design of Shear Wall Buildings Layout. Eng. Struct. 2021, 239, 112311. [Google Scholar] [CrossRef]
Huang, W.X.; Zheng, H. Architectural Drawings Recognition and Generation through Machine Learning. In Proceedings of the Recalibration: On Imprecision and Infidelity: Proceedings of the 38th Annual Conference of the Association for Computer Aided Design in Architecture; Association for Computer Aided Design in Architecture (ACADIA): Fargo, ND, USA, 2018; pp. 156–165. [Google Scholar]
Zheng, H.; An, K.; Wei, J.X.; Ren, Y. Apartment Floor Plans Generation via Generative Adversarial Networks. In Proceedings of the RE: Anthropocene, Design in the Age of Humans: Proceedings of the 25th International Conference on Computer-Aided Architectural Design Research in Asia (CAADRIA 2020); The Association for Computer-Aided Architectural Design Research in Asia (CAADRIA): Fargo, ND, USA, 2020; pp. 601–610. [Google Scholar]
Liao, W.J.; Lu, X.Z.; Huang, Y.; Zheng, Z.; Lin, Y. Automated Structural Design of Shear Wall Residential Buildings Using Generative Adversarial Networks. Autom. Constr. 2021, 132, 103931. [Google Scholar] [CrossRef]
Minsky, M. Steps toward Artificial Intelligence. Proc. IRE 1961, 49, 8–30. [Google Scholar] [CrossRef]
Sutton, R.S. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. In Machine Learning Proceedings 1990; Morgan Kaufmann: Burlington, MA, USA, 1990; pp. 216–224. [Google Scholar]
Ha, D.; Schmidhuber, J. World Models. arXiv 2018, arXiv:1803.10122. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Wang, Z.Y.; Schaul, T.; Hessel, M.; van Hasselt, H.; Lanctot, M.; de Freitas, N. Dueling Network Architectures for Deep Reinforcement Learning. In International Conference on Machine Learning; PMLR: New York, NY, USA, 2016. [Google Scholar]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 1889–1897. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Konda, V.; Tsitsiklis, J. Actor-Critic Algorithms. In Proceedings of the Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1999; Volume 12. [Google Scholar]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft Actor-Critic Algorithms and Applications. arXiv 2019, arXiv:1812.05905. [Google Scholar] [CrossRef]
Hayashi, K.; Ohsaki, M. Reinforcement Learning and Graph Embedding for Binary Truss Topology Optimization Under Stress and Displacement Constraints. Front. Built Environ. 2020, 6, 59. [Google Scholar] [CrossRef]
Zhu, S.J.; Ohsaki, M.; Hayashi, K.; Guo, X.N. Machine-Specified Ground Structures for Topology Optimization of Binary Trusses Using Graph Embedding Policy Network. Adv. Eng. Softw. 2021, 159, 103032. [Google Scholar] [CrossRef]
Jeong, J.; Jo, H. Deep Reinforcement Learning for Automated Design of Reinforced Concrete Structures. Comput.-Aided Civ. Infrastruct. Eng. 2021, 36, 1508–1529. [Google Scholar] [CrossRef]
Fu, B.C.; Gao, Y.Q.; Wang, W. A Physics-informed Deep Reinforcement Learning Framework for Autonomous Steel Frame Structure Design. Comput.-Aided Civ. Infrastruct. Eng. 2024, 39, 3125–3144. [Google Scholar] [CrossRef]
Du, M.; Gao, Y.; Wang, W.; Fu, B. FrameGym: A Reinforcement Learning Environments for Steel Frame Structures. Eng. Struct. 2025, 343, 120991. [Google Scholar] [CrossRef]
Hallak, A.; Di Castro, D.; Mannor, S. Contextual Markov Decision Processes. arXiv 2015, arXiv:1502.02259. [Google Scholar] [CrossRef]
Schaul, T.; Horgan, D.; Gregor, K.; Silver, D. Universal Value Function Approximators. In Proceedings of the 32nd International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 1312–1320. [Google Scholar]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv 2016, arXiv:1606.01540. [Google Scholar] [CrossRef]
Holzer, D.; Hough, R.; Burry, M. Parametric Design and Structural Optimisation for Early Design Exploration. Int. J. Archit. Comput. 2007, 5, 625–643. [Google Scholar] [CrossRef]
JGJ 7-2010; Technical Specification for Space Frame Structure. China Architecture & Building Press: Beijing, China, 2010.
Precup, D.; Sutton, R.S.; Singh, S.P. Eligibility Traces for Off-Policy Policy Evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2000; pp. 759–766. [Google Scholar]
Gythiel, W.; Mommeyer, C.; Raymaekers, T.; Schevenels, M. A Comparative Study of the Structural Performance of Different Types of Reticulated Dome Subjected to Distributed Loads. Front. Built Environ. 2020, 6, 56. [Google Scholar] [CrossRef]
GB/T 17395-2024; Dimensions, Shapes, Masses and Tolerances of Steel Tubes. China Iron and Steel Association: Beijing, China, 2024.

Figure 1. SFO-Agent. An automated optimal design framework for space frame structures.

Figure 2. SFO-Environment based on parametric modeling and FEM analysis.

Figure 3. Context-conditioned state representation across different design cases.

Figure 4. Schematic of the environment state transition.

Figure 5. SFO-Agent architecture and data flow.

Figure 6. Schematic of the Kiewitt dome structure.

Figure 7. Average reward during 600 episodes.

Figure 8. Training stability analysis of the SFO-Agent over five random seeds. (a). Reward curves of five independent random-seed runs. (b). Average reward curve and ±1 standard deviation over five random seeds.

Figure 9. Model of typical case.

Figure 10. Results of the step with the highest reward.

Figure 11. Rewards of 100 random design cases.

Figure 12. Agent performance across different context dimensions. (a). Reward as a function of structural diameter. (b). Reward as a function of rise-to-span ratio. (c). Reward as a function of additional dead load. (d). Reward as a function of live load.

Figure 13. Comparison of training reward curves between SAC and PPO under the same FEM interaction budget.

Figure 14. Structural layouts of SL-LCRS, SL-HPRS, and SL-SSRS.

Figure 15. Average reward during training and results on 100 random test cases. (a,b). Training and test results for SL-LCRS. (c,d). Training and test results for SL-HPRS. (e,f). Training and test results for SL-SSRS.

Table 1. Definitions of the parameters in the state.

No.	Parameter	Definition	Unit
1	D	Spherical shell span diameter	mm
2	$R_{H D}$	Rise-to-span ratio	-
3	DL	Dead load	N/mm²
4	LL	Live load	N/mm²
5	$l_{1}$	Radial grid size	mm
6	$R_{l 1}$	Ratio of circumferential to radial grid size	-
7	$S_{1}$	Section index of radial members	-
8	$S_{2}$	Section index of circumferential members	-
9	$S_{3}$	Section index of diagonal members	-

Table 2. FEM output metrics used in reward function.

No.	Parameter	Definition	Unit
1	M	Material usage	t
2	UZ	Maximum deflection	mm
3	BF	First-order buckling factor	-
4	$R_{1}$	$Maximum stress ratio of member group S_{1}$	-
5	$R_{2}$	$Maximum stress ratio of member group S_{2}$	-
6	$R_{3}$	$Maximum stress ratio of member group S_{3}$	-

Table 3. Ranges of the state parameters.

No.	Parameter	Unit	Min	Max	Resolution	Description
1	D	mm	20,000	80,000	1000	Spherical shell span diameter
2	$R_{H D}$	-	1/10	1/4	0.01	Rise-to-span ratio
3	DL	N/mm²	0.0003	0.0015	0.0001	Dead load
4	LL	N/mm²	0.0005	0.0015	0.0001	Live load
5	$l_{1}$	mm	2000	6000	100	Radial grid size
6	$R_{l 1}$	-	0.5	1.5	0.1	Ratio of circumferential to radial grid size
7	$S_{1}$	-	0	74	1	Section index of radial members
8	$S_{2}$	-	0	74	1	Section index of circumferential members
9	$S_{3}$	-	0	74	1	Section index of diagonal members

Table 4. Network configuration.

Network	Hidden Layer	Hidden Layer Dimension	Activation Function	Learning Rate
Actor	1	512	ReLU	$1 \times 10^{- 4}$
Actor	2	512	Tanh	$1 \times 10^{- 4}$
Critic 1	1	512	ReLU	$3 \times 10^{- 4}$
Critic 1	2	512	Tanh	$3 \times 10^{- 4}$
Critic 2	1	512	ReLU	$3 \times 10^{- 4}$
Critic 2	2	512	Tanh	$3 \times 10^{- 4}$
Target critic 1	1	512	ReLU	-
Target critic 1	2	512	Tanh	-
Target critic 2	1	512	ReLU	-
Target critic 2	2	512	Tanh	-

Table 5. The 10-step design procedure for typical case.

Step	$l_{1}$ (mm)	$R_{l 1}$	$S_{1}$	$S_{2}$	$S_{3}$	Reward	Compliance	M (t)
1	3600	1.2	37	20	33	9.30	Pass	59.93
2	4400	1.2	56	33	32	10.20	Pass	54.96
3	3600	1.4	46	24	29	10.57	Pass	52.93
4	4400	1.4	51	29	28	12.44	Pass	42.67
5	3900	1.5	52	27	27	10.03	Pass	55.92
6	4200	1.5	49	29	26	10.28	Pass	54.55
7	4100	1.4	52	27	26	9.43	Pass	59.22
8	4200	1.4	52	27	26	8.95	Pass	61.83
9	4000	1.5	49	26	28	10.65	Pass	52.51
10	4300	1.5	52	29	27	10.08	Pass	56.64

Table 6. Design parameters of the comparative cases.

Case for Comparison	D (mm)	$R_{H D}$	DL (N/mm²)	LL (N/mm²)
Case 1	25,000	1/8	0.0014	0.001
Case 2	50,000	1/7	0.0015	0.0005
Case 3	70,000	1/6	0.0008	0.0005

Table 7. Results of algorithm comparison.

Case for Comparison	GA			SFO-Agent
Case for Comparison	Time (s)	Reward	M (t)	Time (s)	Reward	M (t)
Case 1	2036	14.54	10.67	57	13.60	12.44
Case 2	3296	12.29	43.48	102	12.44	42.67
Case 3	5862	5.86	86.06	137	7.90	73.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Xiao, C.; Fan, F.; Zhi, X. A Context-Conditioned Reinforcement Learning Framework for Space Frame Structure Optimization. Buildings 2026, 16, 2321. https://doi.org/10.3390/buildings16122321

AMA Style

Li Y, Xiao C, Fan F, Zhi X. A Context-Conditioned Reinforcement Learning Framework for Space Frame Structure Optimization. Buildings. 2026; 16(12):2321. https://doi.org/10.3390/buildings16122321

Chicago/Turabian Style

Li, Yinbin, Congzhen Xiao, Feng Fan, and Xudong Zhi. 2026. "A Context-Conditioned Reinforcement Learning Framework for Space Frame Structure Optimization" Buildings 16, no. 12: 2321. https://doi.org/10.3390/buildings16122321

APA Style

Li, Y., Xiao, C., Fan, F., & Zhi, X. (2026). A Context-Conditioned Reinforcement Learning Framework for Space Frame Structure Optimization. Buildings, 16(12), 2321. https://doi.org/10.3390/buildings16122321

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Context-Conditioned Reinforcement Learning Framework for Space Frame Structure Optimization

Abstract

1. Introduction

2. Methodology

2.1. SFO-Environment

2.2. State and Action

2.3. Reward Design

2.4. SFO-Agent Design

3. Experiments and Results

3.1. Computational Setup

3.2. Training Results

3.3. Typical Case Analysis

4. Discussion of Agent Performance

4.1. Agent Evaluation

4.2. Algorithm Comparison

4.3. Extension to Other Structural Types

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI