MacHa: Multi-Aspect Controllable Text Generation Based on a Hamiltonian System

Delong Xu; Min Lin; Yurong Wang

doi:10.3390/computers14120503

,

and

¹

College of Computer Science and Technology, Inner Mongolia Normal University, No. 81, Zhao Wuda Road, Saihan District, Hohhot 010022, China

²

Infinite-Dimensional a Hamiltonian System and Algorithms Application Key Laboratory of the Ministry of Education, Inner Mongolia Applied Mathematics Center, No. 81, Zhao Wuda Road, Saihan District, Hohhot 010022, China

^*

Author to whom correspondence should be addressed.

Computers2025, 14(12), 503;https://doi.org/10.3390/computers14120503

This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling (2nd Edition)

Version Notes

Order Reprints

Abstract

Multi-faceted controllable text generation can be viewed as an extension and combination of controllable text generation tasks. It requires the generation of fluent text while controlling multiple different attributes (e.g., negative emotions and environmental protection in themes). Current research either estimates compact latent spaces for multiple attributes, reducing interference between different attributes but making it difficult to control the balance between multiple attributes, or controls the balance between multiple attributes but requires complex searches for decoding. Based on these issues, we propose a new method called MacHa, which trains an attribute latent space using multiple loss functions and establishes a mapping between the attribute latent space and attributes in sentences using a VAE network. An energy model based on the Hamilton function is defined in the potential space to control the balance between multiple attributes. Subsequently, in order to reduce the complexity of the decoding process, we extract samples using the RL sampling method and send them to the VAE decoder to generate the final text. The experimental results show that the MacHa method generates text with higher accuracy than the baseline models after balancing multiple attributes and has a fast decoding speed.

Keywords:

multi-faceted controllable text generation; Hamilton system; large language models; RL sampling

1. Introduction

Controllable text generation (CTG) refers to steering the semantics and style of generated text by incorporating various constraints [1,2]. However, it typically focuses on a specific aspect or controls only a single dimension of the generation process.Multi-aspect controllable text generation (MCTG) can be regarded as an extension and combination of controllable text generation tasks, requiring the simultaneous control of multiple distinct attributes or conditions [3,4]. There is now a growing demand for controlling more attributes—for instance, an intelligent writing assistant may need to simultaneously regulate style, topic, and structure. However, because text corpora with multi-attribute annotations are relatively scarce, directly training a large model that masters all these attributes is difficult. Consequently, recent research has increasingly focused on the task of multi-aspect controllable text generation.

Early work on MCTG relied on handcrafted rules and templates to generate text [5,6]. These approaches, however, suffer from obvious drawbacks: they depend heavily on expert knowledge and exhibit poor domain transferability, making it hard to cope with complex linguistic phenomena and diverse generation demands. Current methods for multi-aspect controllable text generation can be broadly divided into two categories. The first consists of training-stage-based methods, which integrate control conditions during model training so that the model learns to produce desired text. They either guide the model to produce desired text by devising special input [7,8] or steer generation by modifying hidden-layer activations or manipulating latent vectors [9,10]. Yet, during training it is hard for such methods to reconcile the discrepancies among different attributes, often resulting in poor text quality. The second category comprises inference-stage-based methods [11], which intervene in and steer the generation process while the model is producing text, thereby achieving controllable text generation. They either intervene in the generation process in real time—altering the output probability distribution or applying specific rules to bias token selection [12,13]—or, operating within an encoder–decoder framework, employ parameter optimization techniques to maximize the likelihood of all samples in the training set. Influential factors are encoded into multiple latent subspaces, from which different samples are drawn to generate diverse texts, thereby achieving controllable generation [14,15]. However, such approaches require complex iterative search during decoding, resulting in slow inference. Therefore, an ideal method for multi-aspect controllable text generation should simultaneously improve text quality while maintaining high inference efficiency.

Current research on multi-aspect controllable text generation not only faces challenges in balancing attributes and decoding efficiency, but also exhibits significant shortcomings in interpretability. Existing methods largely fail to integrate the explainability of generation mechanisms with complex system modeling, making it difficult to clearly elucidate decision-making logic under multi-attribute constraints. Choi and Kim [16] offer crucial insights into this issue by proposing a method that integrates graph embedding with SHAP-based explainable artificial intelligence (xAI). This approach successfully elucidates the interdependencies within the complex system of financial markets. Their work demonstrates that by rationally merging representation learning with interpretable frameworks, the decision-making processes of complex systems can be made transparent.

This paper introduces MacHa, a novel technique for multi-aspect controllable text generation. The method learns attribute-specific latent spaces via multiple loss functions, uses a variational autoencoder (VAE) to map these spaces to sentence attributes, and applies reinforcement learning (RL) sampling to extract samples. To reconcile conflicting attribute demands, we embed an energy-based Hamiltonian model in the latent space: kinetic energy governs fluency and naturalness, while potential energy steers topic and sentiment. To avoid costly decoding search and accelerate inference, our RL sampler dynamically adjusts its policy during generation, greatly improving sampling efficiency. Consistent with Choi and Kim’s interpretable modeling approach for complex systems, MacHa constructs an explainable decision system through Hamiltonian representation: kinetic and potential energy constraints in the Hamiltonian correspond to different objective dimensions of multi-faceted text generation. Its energy conservation property ensures interpretability in multi-attribute control, while the balancing mechanism between text quality and attribute accuracy during production can be quantitatively explained through the interaction between kinetic and potential energy.

We evaluate our method on multi-aspect controllable text-generation tasks spanning eight categories drawn from the IMDb movie review dataset (two sentiment attributes) [17] and the AGNews dataset (four sentiment attributes) [18]. Compared with state-of-the-art baselines, our approach achieves significant improvements in both generation quality and inference speed.

2. Related Work

2.1. Multi-Faceted Controllable Text Generation

Based on the training phase, the approaches can be specifically divided into supervised fine-tuning and retraining. Supervised fine-tuning adjusts the parameters of a pre-trained model using a specific dataset, aiming to integrate predetermined control attributes into the model architecture and to align the model with the requirements of a particular task. The Plug and Play with Prompts method [19] introduces a prompt-based controllable text generation approach that optimizes continuous prompts to steer the model toward producing text satisfying specific control constraints, all without substantial modifications to the model parameters. The Token-level Feedback RL method [20] presents a hybrid approach combining reinforcement learning with supervised fine-tuning. By incorporating token-level feedback signals during generation, it guides the model to create text aligned with the desired control conditions, thereby enhancing both quality and relevance. Retraining refers to training a model from scratch on a task-specific dataset, ensuring that the resulting model fully satisfies the required control conditions. The CBARTmethod [21] introduces a model called Constrained BART (CBART) for incorporating specified keywords (lexical constraints) into text generation. By attaching a token-level classifier to the encoder, CBART shifts part of the generation burden from the decoder to the encoder, thereby improving the quality of the generated sentences. The FUDGE [12] method trains the model’s controllability over text generation by designing effective instructions and fine-tuning strategies. It also introduces an algorithm that automatically synthesizes constraint datasets using only the task dataset and a large language model with in-context learning capabilities, eliminating the need for manually curated constraint data.

Inference-stage approaches can be specifically categorized into decoding-time intervention and latent-space manipulation. During decoding, intervention reshapes the output’s probability distribution or enforces rules to steer token selection, ensuring that the output aligns with given control conditions. This is typically achieved by employing a classifier or reward model to evaluate generated partial sequences and performing real-time adjustments during the decoding process. The CAV method [22] introduces a technique for steering the generation of large language models by modifying activation vectors during decoding, termed Concept Activation Vector (CAV) control. By adjusting activation vectors linked to specific concepts at inference time, it achieves controllable text generation without retraining or fine-tuning the model. Latent-space manipulation, in turn, governs the generated text by altering activation states in the model’s hidden layers, enabling precise control over the generation process without changing the model’s parameters. Specifically, the MAGIC method [23] disentangles imbalanced attribute correlations during training by leveraging counterfactual feature vectors in the attribute latent space and employs a goal-directed counterfactual augmentation strategy during inference to strengthen attribute relevance, thereby refining multi-aspect control. The JAM method [24] employs a causal model to dissect the key factors underlying text generation and achieves precise control by manipulating latent-space vectors, all without modifying model weights. The MAGIC and JAM methods, along with the selected LatentOps and MacLaSa, belong to the “latent-space manipulation” sub-category and exhibit functional overlap. Therefore, there is no need to include them as additional baseline models to fully demonstrate the advanced nature of the proposed methods, thereby avoiding experimental focus dilution caused by redundant benchmarks.

2.2. Hamilton System

A Hamiltonian system is used to describe the motion of a conservative system—i.e., one in which energy is conserved—and the evolution of the system is specified by a Hamiltonian function H(q, p) that typically represents the total energy, comprising both kinetic and potential contributions. The state of the system is characterized jointly by the position q and the momentum p. In recent years, the development of a Hamiltonian system in artificial intelligence has shown deep, multi-dimensional innovation. Foremost, the concept of Hamiltonian Neural Networks (HNNs) [25] is introduced. Its core idea is to use a neural network to directly learn the Hamiltonian H of a system; in doing so, the network automatically captures the underlying physical conservation laws, such as energy conservation. Furthermore, the principles of HNNs have been leveraged to enhance traditional deep learning architectures—such as Recurrent Neural Networks (RNNs) [26] and Transformers [27]—by incorporating the energy conservation law inherent in a Hamiltonian system, thereby improving model performance on NLP tasks. Zhang et al. [28] propose the Hamiltonian Neural Koopman Operator (HNKO), a machine learning framework that automatically preserves and discovers conservation laws, thereby enhancing the accuracy of dynamical predictions from noisy observations.

The contributions of our proposed MacHa framework are as follows:

(1): We define an energy model based on the Hamilton function, placing it within the latent space and embedding multiple constraints into this energy field: kinetic energy governsthe fluency, naturalness, and diversity of generated text, while potential energy guides the target topic and emotional orientation, thereby coordinating the differences between various attributes during text generation.
(2): During the final sampling phase, we employ an RL-based sampling method that leverages a reinforcement learning mechanism to dynamically adjust the sampling policy within complex decision processes, effectively eliminating the need for intricate search during decoding.
(3): We evaluate the proposed MacHa framework on multi-aspect controllable text generation tasks and observe substantial improvements over current baselines, confirming the effectiveness of MacHa.

3. Method

In Section 3.1 we first formally define the task of multi-aspect controlled text generation. We then describe how to define a latent space (Section 3.2), how to formulate a Hamiltonian energy-based model within this latent space (Section 3.3), and how to generate the final output by sampling from the Hamiltonian energy-based model using an RL-based sampling procedure (Section 3.4). The flowchart of the entire method is shown in Figure 1.

Figure 1. MacHa method flowchart: First, map the text through the VAE encoder and optimize the latent space distribution. Then, construct a Hamiltonian energy model to control the multi-dimensional characteristics of text generation. Finally, generate the target text sequence using reinforcement learning sampling and the VAE decoder.

3.1. Definition of Multi-Aspect Controlled Text Generation

We formally define the multi-aspect controlled text generation task as follows: Assume we have N aspects represented by

T = {T_{1}, \dots, T_{n}}

, and for each aspect

T_{i}

there are n attributes denoted as

{T_{n}^{1}, \dots, T_{N}^{n}}

. The objective of our task is to generate targets that possess multiple attributes, denoted as

t = {t_{1}^{*}, \dots, T_{n}^{*}}

, For example, we might expect to generate sentences that exhibit

T_{1}^{2}

(stemming from aspect 1) and

T_{3}^{4}

(stemming from aspect 3).

3.2. Train a Latent Space

To construct a latent space, we adopt the MacLaSa method proposed by Ding et al. [29], which employs a VAE network equipped with a pre-trained language model. Use

z = {Encoder}_{\emptyset} (x)

to encode any unidirectional sentence x into its hidden representation z; the latent representation of the encoding constructs a latent space. To render the constructed latent space more compact and continuous, we train it with three constraint functions.

We adopt a loss function

L_{E}

, termed the reconstruction–regularization loss, which is designed to achieve two objectives. First, it preserves reconstruction quality, ensuring that the input data x can be accurately reconstructed; i.e., the model’s output generated by the decoder from the latent variable z closely matches the original input. Second, it regularizes the latent space, ensuring that the distribution of the latent space closely approximates a predefined prior, thereby preventing overfitting and promoting compactness and continuity of the latent representations. Its expression is as follows:

\begin{matrix} L_{E} & = {- E}_{q_{\emptyset} (z ∣ x)} \frac{{∥ x - Decoder (z) ∥}^{2}}{2 σ^{2}} \\ + \frac{1}{2} [{∥ μ (x) - μ_{prior} ∥}^{2} + ∥ Tr (\sum (x) + \sum_{prior} - \sum (x) \sum_{prior}^{- 1} \sum (x)) ∥] \end{matrix}

(1)

In Equation (1),

{- E}_{q_{\emptyset} (z ∣ x)}

denotes the expectation computed with respect to the distribution

q_{\emptyset} (z ∣ x)

over the latent variable z. The mean-squared-error (MSE) term

\frac{{∥ x - Decoder (z) ∥}^{2}}{2 σ^{2}}

measures the distance between the input and the reconstructed output. Here

σ^{2}

is the noise variance, used to scale the reconstruction error. By minimizing this term we force the decoder to map the latent representation z back to the original input x as accurately as possible, thereby ensuring that the generated data remain close to the original data. The term

∥ μ (x) - μ_{prior} ∥

is referred to as the mean-difference term; it measures the discrepancy between the mean of the latent distribution

μ (x)

produced by the encoder and the mean of the prior distribution over

μ_{prior}

. By minimizing this term, the mean of the latent distribution is pulled toward the mean of the prior distribution.

Tr (\sum (x) + \sum_{prior} - \sum (x) \sum_{prior}^{- 1} \sum (x))

is named the covariance discrepancy term; it is used to compare the covariance matrix

\sum (x)

of the latent distribution generated by the encoder with the covariance matrix

\sum_{prior}

of the prior distribution. The trace operator is used to compute this discrepancy. By minimizing this term, the mean of the latent distribution is drawn toward the mean of the prior distribution.

The second loss function we use is

L_{C}

, a classification loss. It enables the representation vector

z_{i}

to retain the original attribute information and helps the model distinguish different attribute representations within the same aspect. Its formulation is as follows:

L_{C} = - \sum_{n = 1}^{N} \sum_{j = 1}^{|T|} \sum_{i \in S_{j}^{n}} ln \frac{exp (z_{i} \cdot p_{π_{n}} ω_{t_{n}^{j}}^{n})}{\sum_{k = 1}^{t_{n}^{j}} exp (z_{i} \cdot p_{π_{n}} ω_{k}^{n})}

(2)

In Equation (2),

S_{n}^{j}

is an index set denoting the indices of the representation vectors belonging to the n-th sample of the j-th aspect, and

t_{n}^{j}

is the ground-truth attribute label of the n-th sample of aspect

T_{j}

.

p_{π_{n}} ω_{t_{n}^{j}}^{n}

is the weight vector in the classifier

p_{π_{n}}

. The natural logarithm function corresponding to the ground-truth attribute

t_{n}^{j}

represents the probability that the vector

z_{i}

belongs to the ground-truth attribute

t_{n}^{j}

. By minimizing this loss, we are in effect maximizing the log-probability of the correct attribute class, thereby encouraging the model to predict the true attribute category more strongly.

The third loss function we employ is

L_{D}

, an aspect-difference loss proposed by GU et al. [30] that aligns the distributions of different aspects, enabling the model to produce more consistent representations when handling data from various aspects. By reducing distribution discrepancy, the model can better generalize and handle data from different aspects, preventing performance degradation caused by excessive inter-aspect variation. The formulation is as follows:

L_{D} = \sum_{1 \leq n_{1} < n_{2} \leq N} {∥\sum_{i \in S_{n_{1}}} \frac{z_{i}}{| S_{n_{1}} |} - \sum_{j \in S_{n_{2}}} \frac{z_{i}}{| S_{n_{2}} |}∥}_{2},

(3)

Equation (3) sums the distances between the distribution centers of every aspect pair, thereby quantifying the overall distribution discrepancy across all aspects. Here,

S_{n}

denotes the set of samples belonging to the n-th aspect, and

| S_{n} |

is the size of this set (i.e., the number of samples). Each

z_{i}

is the representation vector of the i-th sample. The entire expression computes the squared Euclidean distance between every pair of distribution centers. This distance quantifies the degree of discrepancy between the distribution centers of every two aspects in the representation space. By minimizing this loss, the distribution centers of different aspects are drawn closer together, thereby reducing the inter-aspect distribution gap. Our overall loss function is

L = λ_{1} L_{E} + λ_{2} L_{C} + λ_{3} L_{D}

. We also update the parameters

θ

and

π_{n}

accordingly.

3.3. Hamiltonian Energy Model

In the previous section, after training a continuous and compact latent space of attributes using a loss function, we achieved a reduction in interference between different aspects. A key challenge we face is how to balance multiple attributes within the latent space. To address this, we propose a Hamiltonian energy model whose core lies in exploiting the energy-conserving and stability properties of a Hamiltonian system. First, we define the joint distribution of the latent vector

z_{i}

from the previous section and the desired attributes c as follows:

p (z, c) = \frac{p_{prior} (z) \cdot e^{- ε_{n} H}}{Z}

(4)

This formulation allows us to generate text with desired attributes by sampling from this joint distribution, where

p_{prior} (z)

is a Gaussian prior. By means of the Hamiltonian function H, we incorporate multiple sets of constraints into the latent space, thereby achieving balanced control over several attributes. We define the Hamiltonian function H in a Hamiltonian system as follows:

H = T (p) + V (q)

(5)

Here, T denotes the kinetic energy, which captures the dynamic variations and information flow during text generation. A higher kinetic energy signifies faster feature changes, yielding more diverse and fluent text. V represents the potential energy, tasked with steering the affective orientation and thematic style of the generated text. We define three constraints within the kinetic term to ensure the fluency, naturalness, and diversity of the generated text. The specific formulas are as follows:

T (p) = T_{s m o o t h} (p) + T_{f u t u r e} (p) + T_{n g r a m} (p)

(6)

In Equation (6),

T_{s m o o t h} (p)

denotes the soft-fluency constraint. Fluency is a common requirement for generated text, whereas soft-fluency is an approach that regularizes solution smoothness during optimization yet tolerates minor violations. Unlike hard constraints, it does not demand strict adherence to smoothness; instead, it incorporates a penalty term into the objective function that penalizes non-smooth portions, thereby balancing smoothness against other constraints throughout the optimization process. This constraint formulation offers a more flexible optimization framework, allowing the generated text to satisfy fluency requirements without overly sacrificing other attributes in the presence of multiple constraints. The corresponding equation is

T_{smooth} (p) = \sum_{v} Q (v_{t} | h_{t}) log P (v_{t} | c_{t})

(7)

Here,

h_{t}

denotes the hidden state at time step t, obtained by applying a nonlinear function f to the hidden state

h_{t - 1}

from the previous time step and the current input

x_{t}

. Let

c_{t}

denote the context vector, and let

Q (v_{t} | h_{t})

be the predictive distribution over the vocabulary

v_{t}

given the hidden state

h_{t}

; this distribution is used for text generation.

l o g P (v_{t} | c_{t})

denotes the log-probability of a token

v_{t}

under the true distribution given the context vector

c_{t}

; this true distribution is used to assess the quality of the generated text. The entire expression means the following: for every token v we compute the product of its weight

Q (v_{t} | h_{t})

under the predicted distribution Q and its log-probability

l o g P (v_{t} | c_{t})

under the true distribution p and then sum over all tokens. We aim to adjust the parameters of the predictive distribution so that it becomes as close as possible to the true distribution.

In Equation (6),

T_{f u t u r e} (p)

denotes the future-word prediction constraint, which restricts or guides the model’s selection of upcoming tokens during generation, ensuring the produced text is more expected and coherent and meets specific requirements. The future-word constraint can operate on lexical, syntactic, or semantic levels. By restricting the set of candidate tokens that may appear next, it filters out implausible or undesirable outputs and thereby improves text quality. Its mathematical formulation is

T_{future} (p) = log P (v_{t + k} | h_{t}, f_{t})

(8)

Here,

h_{t}

denotes the hidden state at time step t, and

f_{t}

represents certain features at time step t.

P (v_{t + k} | h_{t}, f_{t})

is a conditional probability denoting the likelihood that the observation

v_{t + k}

occurs at time step

t + k

, given the hidden state

h_{t}

and features

f_{t}

at time step t. k represents the prediction step size, i.e., the state we want to predict k steps ahead from the current time step t.

In Equation (6),

T_{future} (p)

is the n-gram similarity constraint, which aims to limit the similarity between the generated text and the reference text at the n-gram level. It calculates the n-gram similarity between the generated text and the reference text (such as Jaccard similarity, cosine similarity, etc.) and controls it within a certain threshold range to avoid generating overly repetitive or highly similar text. The formula is expressed as follows:

T_{ngram} (p) = 1 - s i m i l a r i t y (y, y^{'})

(9)

Let

s i m i l a r i t y (y, y^{'})

be a similarity function that measures the similarity between the generated text sequence y and the reference text sequence

y^{'}

. At its core, it calculates the n-gram overlap between the generated sequence and the reference sequence using BLEU scores. After applying precision calculations, short sentence penalties, and normalization, it yields a similarity value within the range [0, 1]. We compute their dissimilarity by subtracting this similarity score from 1.

In a Hamiltonian system, the potential energy

V (q)

is employed to regulate multiple attributes. It converts the classifier’s logits into probabilities via the softmax function and balances the strengths of sentiment and topic control through a logarithmic transformation and learnable weight parameters. Within the Hamiltonian framework, this potential guides the evolution of the latent vector z, steering the generated text toward the desired sentiment and topic style. We define the potential

V (q)

as follows:

V (q) = - α \cdot log (\frac{e^{f_{sentiment} (z)}}{\sum_{c \in C} e^{f_{sentiment}^{c} (z)}}) - β \cdot log (\frac{e^{f_{topic} (z)}}{\sum_{t \in T} e^{f_{topic}^{c} (z)}})

(10)

The term

- α \cdot log (\frac{e^{f_{sentiment} (z)}}{\sum_{c \in C} e^{f_{sentiment}^{c} (z)}})

represents the sentiment-related component of the potential energy, where

α

is a weighting parameter that governs the strength of the sentiment constraint. The negative sign combined with the logarithmic function ensures that as the probability of the target sentiment increases, the potential

V (q)

decreases. This encourages the generated text to align more closely with the desired sentiment. The second term represents the topic-related component of the potential energy, with

β

as the weighting parameter that modulates the strength of the topic constraint. Likewise, the negative sign coupled with the logarithmic function causes the potential

V (q)

to decrease as the probability of the target topic rises, thereby steering the generated text toward the desired topic. Using the approach described in this section, we obtain a joint distribution

p (z, c)

that satisfies the desired balance among multiple attributes. When our task involves additional attributes, we need only introduce corresponding constraint terms into the

V (q)

formula to achieve unified potential energy constraints across multiple attributes. As the number of attributes increases, conflicts between constraints across multiple attributes become more pronounced. Leveraging the principle of “energy conservation,” Hamilton’s formula can extend the constraint terms corresponding to new attributes through the potential energy

V (q)

to balance conflicts among multiple attributes.

3.4. RL-Based Sampling Method

Once the joint distribution

p (z, c)

is obtained, the next challenge is to reduce the complexity of the decoding search and accelerate inference. To address this, we adopt an RL-based sampling method. Its core idea is to leverage reinforcement learning mechanisms to dynamically adjust sampling paths in the latent space, while introducing multiple reward functions that jointly account for fluency, sentiment orientation, topic style, and other attributes. First, we define a policy network

π_{θ} (a | z, c)

, which takes the latent vector z and target attribute c from the joint distribution p as input and outputs a probability distribution of sampling actions a, representing the probability of taking different sampling actions in the current state. The strategy network

π_{θ}

is typically a neural network that can learn how to select the optimal sampling direction based on the current latent vector and target attributes. In the second step, the policy network

π_{θ}

is used to compute a probability distribution over all possible actions in the current state. For a discrete action space, the outputs of the policy network can be converted into a probability distribution using the softmax function:

π_{θ} = s o f t m a x (N N_{θ} (z_{t}, c))

(11)

Here

N N_{θ}

denotes the policy network. We then sample a concrete action

a_{t}

from the computed probability distribution. For a discrete action space, we draw the action from a categorical distribution [31]:

a_{t} \sim C a t e g o r i c a l (π_{θ} (a | z, c))

. In the third step, the latent vector

z_{t}

is updated to a new position

z_{t + 1}

according to the chosen action

a_{t}

. The update rule is as follows:

z_{t + 1} = z_{t} + a_{t}

(12)

We then feed the updated latent vector

z_{t + 1}

into the VAE decoder to produce the corresponding text sequence

x_{g e n t, t}

, which is expressed as follows:

x_{g e n t, t} = V A E - D e c o d e r (z_{t + 1})

(13)

In the fourth step, a step reward

r_{t}

is computed based on the quality of the generated text and its degree of alignment with the desired attributes. The reward function can integrate the following aspects.

Fluency reward: It measures by the perplexity (PPL) of a language model. The lower the perplexity, the more fluent the text and the higher the reward:

r_{f l u e n c y, t} = - P P L (x_{g e n, t})

(14)

Attribute relevance reward: It quantifies how well the generated text matches the target attributes by using classifier scores. For each aspect, the formula computes the relative proportion of the target attribute’s score to the sum of scores across all attributes of that aspect, adds the logarithms of these proportions, and sums them up. A larger proportion indicates that the target attribute is more prominent within that aspect, yielding a larger log-value and a higher reward. In short, this reward function encourages the model to generate text that closely matches the target attributes across all aspects, expressed as follows:

r_{a t t a, t} = \sum_{n = 1}^{N} l o g (\frac{e x p (λ_{n} \cdot S_{a_{n}^{*}, t})}{\sum_{k = 1}^{| A_{n} |} e x p (λ_{n} \cdot S_{a_{n}^{k}, t})})

(15)

Here,

S_{a_{n}^{*}, t}

denotes the classifier score for the target attribute

a_{n}^{*}

—i.e., the attribute we want the generated text to possess—in the n-th aspect; this score indicates how likely the generated text is to possess the target attribute in the n-th aspect.

S_{a_{n}^{k}, t}

is the classifier score for the h-th attribute in the n-th aspect at time step t, reflecting the likelihood that the generated text belongs to that attribute. Meanwhile,

λ_{n}

is a parameter that balances the weights of the different attributes.

Text diversity reward: This term measures the diversity of the generated text and encourages the model to produce rich and varied content while avoiding monotonous or repetitive sentences. It is typically computed from the usage frequencies of distinct words or phrases, expressed as follows:

r_{d i v, t} = D i s t i n c t (x_{g e n, t})

(16)

Here,

r_{d i v, t}

denotes the text diversity reward at time step t, which quantifies the richness of the generated text.

D i s t i n c t (x_{g e n, t})

is a metric that quantifies the diversity of the generated text

x_{g e n, t}

; its value typically ranges from 0 to 1, with higher values indicating greater textual diversity. We obtain the overall reward by taking a weighted sum of the individual rewards described above:

r_{t} = ω_{1} \cdot r_{f l u e n c y, t} + ω_{2} \cdot r_{a t t r, t} + ω_{3} \cdot r_{d i v, t}

(17)

In the fifth step, we compute the cumulative reward

R (τ)

for the sampling trajectory

τ

, typically applying a discount factor

γ

to account for the decay of future rewards:

R (τ) = \sum_{t = 0}^{T} γ^{t} r_{t}

(18)

Here,

τ

lies in the range

0 \leq τ \leq 1

and governs the weight given to future rewards, while

r_{t}

denotes the immediate reward at time step t, measuring the quality of the current sampling step. T denotes the terminal time step of the sampling trajectory, marking the moment when the sampling process ends. We employ the REINFORCE algorithm [32] to compute the gradient of the policy network parameters:

\nabla_{θ} J (θ) = E_{γ \sim π_{θ}} [R (τ) \nabla_{θ} log π_{θ} (γ)]

(19)

\nabla_{θ} J (θ)

denotes the gradient of the policy network’s objective function

J (θ)

with respect to the parameters

θ

. The objective function

J (θ)

is typically expressed as the expected value of the cumulative reward over the sampling trajectory

τ

. By computing this gradient, we learn how to adjust the parameters

θ

so as to maximize the expected cumulative reward.

E_{γ \sim π_{θ}}

denotes the expectation over the sampling trajectory

τ

, which is generated under the current policy

π_{θ}

, the expectation reflects the average cumulative reward that can be attained under the current policy across all possible sampling trajectories. Meanwhile, the gradient of the log-probability of the policy

π_{θ} (γ)

with respect to the parameters

θ

is given by

\nabla_{θ} log π_{θ} (γ)

. The gradient of the log-probability indicates how to adjust the parameters

θ

so as to increase the likelihood of the trajectory

τ

. At its core, the algorithm uses the cumulative reward of each sampled trajectory to update the policy network parameters, thereby making trajectories that yield high cumulative rewards more probable in the future. We update the parameters

θ

of the policy network based on the calculated gradient:

θ_{t + 1} = θ_{t} + α \cdot \nabla_{θ} J (θ)

(20)

α

represents the learning rate. We repeat steps 2 to 5 until the policy network converges; i.e., the sampling policy can stably generate high-quality text that meets the target attributes.

3.5. Theoretical Analysis of the Hamilton Energy Model

The core challenge in multi-faceted controllable generation lies in balancing “multi-attribute collaborative control” with “textual fluency, diversity, and naturalness,” a requirement inherently suited to the physical properties of a Hamiltonian system. Due to its energy conservation property, the total energy

H = T (p) + V (q)

(kinetic energy T + potential energy V) of a Hamiltonian system remains conserved, ensuring that attribute control (potential energy constraints) and textual fluency, diversity, and naturalness (kinetic energy constraints) do not cancel each other out. Simultaneously, a Hamiltonian system describes state evolution via the canonical equations (

q = \frac{\partial H}{\partial p}

,

p = - \frac{\partial H}{\partial q}

). When generated text deviates from target attributes,

p = - \frac{\partial H}{\partial q}

converts the reverse attribute gradient generated by potential energy

V (q)

into momentum p update signals. By adjusting p’s evolution rate and distribution (subject to

T (p)

’s constraints on fluency, diversity, and naturalness), it corrects attribute deviations while preserving text quality. On the other hand, to address control requirements for text fluency, diversity, and naturalness, when text quality degrades (e.g., insufficient diversity or sentence fragmentation),

q = \frac{\partial H}{\partial p}

converts the quality optimization gradient generated by kinetic energy

T (p)

into an adjustment signal for position q. This guides q to optimize semantic expression within the target attribute subspace, ensuring quality improvement without deviating from attribute constraints. Finally, its high-dimensional space adaptability resolves issues such as high attribute constraint dimensions and complex interactions in multi-aspect controllable text generation. A Hamiltonian system exhibits excellent stability in high-dimensional canonical coordinate systems, with state evolution independent of coordinate system selection. This enables handling complex multi-attribute interactions without compromising attribute representation integrity, thereby avoiding the dimension catastrophe faced by traditional methods under high-dimensional constraints.

3.6. Property Decoupling Mechanism in Cartesian Coordinate Systems

The canonical coordinate system (q, p) of a Hamiltonian system provides a natural framework for attribute decoupling. The property representation function of momentum p directly corresponds to the “quality features” of text (fluency, diversity, and naturalness) in the latent space, with different quality dimensions forming independent subregions in p-space —for instance, “fluency” corresponds to a continuous evolutionary trajectory of p, while “diversity” corresponds to the semantic scope covered by p. A Hamiltonian system ensures the independence of quality features within p-space through the constraint of kinetic energy

T (p)

. q, as the “attribute driver,” specifically adjusts the weights of sentiment and topic. q enhances activation intensity in semantically relevant regions within q-space through gradient signals from potential energy

V (q)

. When an attribute meets the requirements, q automatically reduces its weight to optimize another attribute, achieving dynamic decoupling between attributes.

3.7. Correlation Mapping Between Energy and Text Attributes

In classical physical systems, kinetic energy serves as a core physical quantity describing the motion state of particles. Its inherent triple characteristics—fluidity (continuity of particle trajectories), regularity (compliance with physical laws), and variability (diversity of motion paths)—form a highly intuitive analogy with the fluency, naturalness, and diversity required in text generation tasks. Simultaneously, potential energy within physical systems establishes clear correspondences with textual semantic attributes (such as thematic orientation and emotional bias). These associations can be broken down as follows:

Fluency: The magnitude of kinetic energy directly reflects the coherence of textual sequences. When kinetic energy remains within a reasonable range, the transitions between words and sentences align with linguistic conventions, flowing as smoothly and naturally as the uniform motion of physical particles. If kinetic energy is too low, the text may suffer from logical gaps and stuttering expressions. Conversely, excessively high kinetic energy can lead to redundant sentences and disordered word order, resembling the chaotic state of particles in uncontrolled motion. Naturalness: The “regularity” dimension of kinetic energy corresponds to the naturalness of text. In physical systems, the laws governing particle motion are constrained by physical principles; in text generation, kinetic energy can constrain textual expression patterns through probability distributions (such as co-occurrence probabilities of vocabulary or syntactic compliance probabilities), ensuring generated content aligns with natural human language usage habits. For instance, the stable state of kinetic energy corresponds to the probability entropy of a text’s lexical distribution falling within a reasonable range, at which point the text’s word choice and sentence structures exhibit the natural characteristics of human language. Diversity: The “variability” dimension of kinetic energy corresponds to textual diversity. In physical systems, fluctuations in kinetic energy reflect the richness of particle trajectories; in text generation, dynamic adjustments to kinetic energy break the constraints of monotonous expression patterns, enabling models to exhibit diverse characteristics in lexical selection, sentence construction, and semantic expression. For instance, when kinetic energy fluctuates within reasonable bounds, generated text avoids monotony from excessive uniformity while preventing chaos from excessive randomness, thereby achieving a balance between content diversity and readability. Semantic Properties: Potential energy in physical systems is determined by the position or configuration of particles, reflecting the system’s inherent constraints and target state. Analogously, in text generation, potential energy corresponds to the semantic properties of the text. Its value is determined by predefined semantic control conditions (such as topic, sentiment, and key information): when the generated text’s semantics align with the control objectives, potential energy resides in a low, stable state; when semantics deviate from control targets, potential energy increases, forming a constraint feedback loop. This mechanism mirrors particles in physical systems being drawn toward stable positions by potential energy fields, ensuring generated text consistently adheres to preset semantic requirements and achieving precise control over multiple attributes.

4. Experiment

In this section, we conduct an automatic evaluation of our method, demonstrating its effectiveness. We further provide an in-depth analysis of the reasons for its superiority and present visualizations of the results.

4.1. Dataset

Our experiment involved simultaneous control along two dimensions: topic and sentiment. For the topic dimension, we used the AGNews dataset, which comprises four topics: World, Sports, Business, and Technology. For the sentiment dimension, we adopted the IMDb movie review dataset, which contains positive and negative sentiments. We randomly sampled 20 k sentences from each dataset for each attribute as in [29] to train our method. We then selected the same 15 attribute-agnostic prompts and instructed the model to generate 50 sentences with the desired attribute for each prompt.

4.2. Experimental Environment and Parameter Settings

All experiments ran on an NVIDIA A800 (80 GB) GPU (From the Autodl Computing Cloud Platform)and a Intel 14-core Intel Xeon Gold 6348 CPU (From the Autodl Computing Cloud Platform). MacHa’s VAE encoder–decoder pair is initialized with BERT, and the latent dimension is fixed at 128. To align with MacLaSa, we use AdamW (lr = 8

\times 10^{- 5}

) and train for 50 epochs. We set the weights

λ_{1}

,

λ_{2}

, and

λ_{3}

in the latent attribute space construction to 1, while manually tuning the weight

ϵ_{n}

in the Hamiltonian energy model to balance the different attributes. For the decoding component, we set the weights

ω_{1}

,

ω_{2}

, and

ω_{3}

to 0.5 and the learning rate to 1.

4.3. Baseline Method

We compared eight baseline models. PPLM [33] combines a pre-trained language model with one or more attribute classifiers, steering generation toward desired attributes by updating the model’s hidden representations. DE-XPERTSM [34] a pre-trained language model with an “expert” model and an “anti-expert” model. The expert models the desired textual attribute, while the anti-expert models the undesired attribute. During decoding, a token receives a high probability only if it is deemed likely by the expert and unlikely by the anti-expert. MaRCo [35] employs a Product-of-Experts (PoE) approach with autoregressive language models, using an LM-expert and an LM-anti-expert to identify and replace potentially toxic tokens, thereby rewriting text into a less harmful form. Mix and Match [36] is a sampler extraction method built on an energy-based model. It defines an energy function to quantify sample quality and employs Metropolis–Hastings sampling to iteratively generate low-energy, high-quality samples. Multiple attribute discriminators evaluate different attributes of each sample, and their outputs are incorporated into the energy function so that the sampling process is steered toward producing samples with the desired attributes. Contrastive Prefixes [37] trains a set of attribute-specific prefixes to steer large language models toward generating natural language that exhibits the desired attributes, while also modeling inter-prefix relationships to enable both single-attribute and multi-attribute control. LatentOps [38] connects a pre-trained language model to a latent space and samples text vectors within it via an ODE-based method, subsequently decoding them into the desired text sequences, thereby enabling efficient and composable text manipulation.

Distribution first estimates the attribute space with an autoencoder, then iteratively narrows the distance to different attribute points to locate the intersection of multiple attribute distributions, and finally maps this intersection into attribute-relevant sentences via a prefix-tuned decoder. MacLaSa achieves multi-aspect controllable text generation by estimating a compact latent space for multiple attributes and employing a fast ODE-based sampler.

4.4. Evaluation Criteria

We adopt the same three evaluation metrics used in MacLaSa to assess performance on the two controlled generation tasks. Correctness: To ensure consistent variables and facilitate fair comparison with MacLaSa and other baselines, we employed the fine-tuned sentiment and topic classifiers released in [39], using them to compute the proportion of generated sentences that possess the desired attribute. Perplexity (PPL): This metric reflects how “perplexed” a model is by a piece of text; the lower the value, the more accurately the model predicts the text. It is computed from the model’s predicted probability of each word in the sequence. We use GPT-2 to calculate the perplexity scores of the generated text. Distinctness: This metric assesses textual diversity by computing the ratio of unique n-grams to the total number of n-grams in the generated text. In our experiments, we report Distinctness-1 and Distinctness-2. We have also incorporated human evaluation metrics to further validate the superiority of our approach.

4.5. Comparison of Experimental Results and Analysis

As shown in Table 1, we compare our approach against the baselines described in Section 4.3. Our experiment covers two tasks with a total of eight attribute combinations, and we report the average score across these eight combinations. To directly highlight the advantages of our method, we fully reproduce the experimental results from MacLaSa [29]. Table 2 presents the detailed performance of our method versus the other baselines across all eight attribute combinations. Experimental results for all methods except the MacHa method are sourced from [29]. PPLM, DEXPERTS, and MaRCo perform well on certain combinations but achieve low accuracy on others, yielding a poor average attribute accuracy score; moreover, the texts they produce exhibit mediocre language quality and high perplexity. We hypothesize that PPLM steers generation toward desired attributes via gradient-based updates, but the gradient signals can become unstable or imprecise under different attribute combinations. For some combinations, the gradients may fail to effectively guide the model toward attribute-compliant text, causing a drop in attribute accuracy. Moreover, PPLM’s optimization objective prioritizes attribute control while paying comparatively little attention to text quality optimization. For certain combinations, DEXPERTS may over-rely on the expert model in order to boost attribute accuracy, which degrades text quality; for other combinations, it may fail to sufficiently integrate expert guidance, resulting in low attribute accuracy. MaRCo controls attributes and steers generation by masking tokens, yet for some attribute combinations these masked tokens may exhibit only weak correlations with the target attributes, preventing the model from effectively guiding generation toward attribute-compliant text. Moreover, its masking strategy lacks precision, which leads to poorer overall text quality. Although Mix and Match and Contrastive Prefixes achieve relatively high attribute accuracy, their text quality is poor and their perplexity is high. We conjecture that Mix and Match constructs an energy model by integrating multiple attribute discriminators; however, this model may inadequately capture the complex structure and semantic nuances of text, yielding outputs with poor fluency and naturalness. In addition, its Metropolis—Hastings sampler, operating in the high-dimensional text space, is prone to local optima and struggles to locate high-quality textual solutions. Contrastive Prefixes, on the other hand, independently trains a separate prefix for each attribute without any coordination among them. When multiple attribute prefixes are combined, their individually guided semantic directions can conflict, undermining textual coherence and yielding low-quality generations. Although LatentOps and Distribution achieve a good balance between language quality and attribute accuracy, they suffer from low textual diversity; the Distinct metrics reveal a high degree of repetition in their generated outputs. MacLaSa shows a clear improvement in the diversity metric, yet it is still far from optimal. We hypothesize that LatentOps may not fully explore all regions of the latent space, causing the generated text to cluster in certain local areas and resulting in a lack of diversity. The Distribution method generates text by searching for the intersection regions of multiple attribute distributions in latent space. However, these intersection regions are relatively limited, causing the model to discover only a small number of textual patterns that satisfy the desired attributes. Consequently, the generated text exhibits highly repetitive phrasing and lacks diversity. MacLaSa employs an energy-based model (EBM) whose single energy function cannot adequately capture textual diversity. Different attribute combinations call for distinct energy functions to guide generation, yet the method does not design specialized energy functions for each combination, thereby limiting the diversity of the generated text. Moreover, the trajectories of its ODE sampler are overly dependent on the initial conditions and the form of the differential equations, which in turn restricts the diversity of the generated text.

Table 1. Evaluation metrics results for the MacHa method and other baseline methods.

Table 2. Detailed results for each baseline method across all eight combinations (two sentiment attributes × four topic attributes).

Compared with the strongest baseline, MacHa achieves higher accuracy while keeping perplexity stable, demonstrating its superiority in multi-aspect controllable generation. Moreover, unlike the baselines above, MacHa not only balances text quality and attribute accuracy but also significantly improves textual diversity. MacHa reduces decoding-time inference to 0.05 s. In summary, compared with existing baselines, MacHa delivers substantial improvements in text quality, attribute accuracy, generation diversity, and inference speed. Table 3 presents the human evaluations results for the multi-dimensional control task. Inter-rater agreement achieved a Fleiss’

κ

coefficient of 0.32. All baseline model comparison results are cited from [28]. The experimental findings demonstrate that our method outperforms other baseline approaches in terms of correctness, fluency, and diversity. This confirms that our method delivers superior performance in both automated and human evaluations.

Table 3. Results of human evaluations.

4.6. Ablation Analysis

To further analyze MacHa, we conducted an ablation study on the components described in Section 3.2, evaluating the reconstruction–regularization loss

L_{E}

, the classification loss

L_{C}

, and the aspect-difference loss

L_{D}

. As shown in Table 1, removing

L_{E}

prevents the model from accurately generating outputs that resemble the input data x from the latent variable z. This loss is responsible for ensuring faithful reconstruction, i.e., guaranteeing that the decoder can produce samples close to the original data given the latent variables. Once it is removed, the model loses its constraint on generation quality, fails to learn an effective generative mechanism, and consequently cannot produce samples that meet the required standards. When

L_{C}

is removed, the model can no longer reliably distinguish the representations of different attributes within the same aspect. This arises because the loss normally enforces classification accuracy; without it, the model lacks the constraint needed to learn clear boundaries and feature differences among attributes. Consequently, the model cannot accurately steer the attribute orientation of the generated text, resulting in outputs that deviate from—or even contradict—the intended attributes. When

L_{D}

is removed, the model is no longer able to effectively reduce the distributional discrepancies between data from different aspects in the latent space, leading to a drop in attribute accuracy scores. This occurs because

L_{D}

reduces discrepancies by minimizing the summed distances between the centroids of different aspect distributions. Without this loss, the distances between these centroids in the latent space remain uncontrolled, causing the distributional divergence to increase. In addition, we conducted an ablation study on the method in Section 3.4 to demonstrate the superiority of our sampling strategy. Using the same VAE parameters as in MacLaSa, we compared all sampling methods mentioned in that work, including random sampling, Langevin-dynamics-based decoding (LD), and ODE-based sampling. As shown in Table 4, our analysis reveals that random sampling directly draws from the latent space without considering the constraints imposed by the target attributes, thereby producing undesired attribute combinations. Moreover, it fails to effectively explore the regions of the latent space relevant to the target attributes, frequently yielding low-quality or irrelevant text. LD is highly sensitive to hyper-parameter choices; unstable settings can cause the sampling process to diverge or converge slowly, degrading text quality. Moreover, it shows poor attribute alignment—the generated text is insufficiently correlated with the target attributes, resulting in lower attribute accuracy. ODE-based sampling trajectories are overly dependent on the initial conditions and the particular form of the differential equations. This dependence confines exploration to specific paths in the latent space, preventing broader coverage and thereby restricting the diversity of the generated text. In contrast, MacHa’s RL-based sampler is not tied to a fixed differential equation or initial condition. It flexibly explores diverse trajectories in the latent space, thereby broadening the diversity of generated text. Furthermore, the sampler dynamically adjusts its strategy according to reinforcement learning reward signals, allowing it to better adapt to different attribute combinations and generation objectives.

Table 4. Evaluation results of different sampling methods.

4.7. Visualization of Potential Space

To provide an intuitive view of the learned latent space, we visualized a subset of it using t-SNE [40], covering four attributes: positive, negative, business, and science. As shown in Figure 2, the classification loss clearly separates attributes within the same aspect—e.g., positive vs. negative and business vs. science—into two opposing clusters that do not interfere with each other, which facilitates the subsequent Hamiltonian energy model in distinguishing mutually exclusive attributes. Thanks to the aspect-difference loss, attributes from the two aspects—sentiment and topic—lie close to one another, which helps the model generate high-quality text in subsequent steps. We also observe that the positive and science attributes overlap more than others, a pattern we attribute to the continuous progress of science and technology, which generally yields positive impacts.

Figure 2. t-SNE projection in latent space.

4.8. Verification of Explainability Based on Causal Inference Methods

The “Average Causal Effect (ACE)” detection method proposed by Qiu et al. in their study on causal intervention in large language models provides a quantitative validation tool for the interpretability of our approach. This research confirms that causal manipulation can achieve controllability and interpretability in generated content. Drawing from this approach, we selected the affect weight (

α

) and topic weight (

β

) of the Hamilton potential term in MacHa as intervention variables. Using the affect accuracy (

S_{acc}

) and topic accuracy (

T_{acc}

) of generated text as outcome variables, we calculated ACE values after controlling for irrelevant variables such as kinetic energy weight: results show that

α

exhibits an ACE value of 0.32 (p < 0.001) for

S_{acc}

, while

β

yields an ACE value of 0.28 (p < 0.001) for

T_{acc}

, indicating significant causal relationships between Hamilton parameters and attribute accuracy. Conversely, ACE values for

α

for

T_{acc}

and

β

on

S_{acc}

were both <0.03 and p > 0.05, demonstrating the effectiveness of the decoupling mechanism between kinetic and potential energy in preventing causal interference among attributes.

5. Summary and Conclusions

5.1. Limitations

Our method trains a compact latent space with multiple loss functions and employs a VAE to map between the attribute latent space and sentence attributes. Nevertheless, this intricate latent structure can impede full convergence, particularly on large-scale datasets or when modeling complex textual attributes. The added complexity may introduce noise and redundant information, compromising both model stability and generation quality. Moreover, our approach uses the kinetic and potential energies of the Hamiltonian energy model to control the fluency and attribute orientation of text generation. However, this balance relies on predefined energy functions and weight parameters, which may be difficult to tune precisely for varying text generation demands. For instance, in some tasks fluency may outweigh attribute adherence, whereas in others the reverse may be true. The current energy model design lacks the flexibility to dynamically rebalance this trade-off. Moreover, when applied to large language models with extremely large parameter scales, this method still has room for improvement in both training efficiency and performance.

5.2. Conclusions and Future Work

In this paper, we propose a novel method—MacHa—for multi-aspect controllable text generation. MacHa first trains a compact latent space under three sets of constraints and then introduces a Hamiltonian energy model that leverages the energy-conserving and stable dynamics of a Hamiltonian system to balance multiple attributes within the latent space. Finally, an RL-based sampling strategy eliminates the need for complex search during decoding and markedly accelerates inference. We evaluate MacHa on eight attribute combinations spanning two control dimensions, and extensive experiments confirm its effectiveness. We also provide in-depth ablation studies and a visualization of the learned latent space. Given the limitations outlined in Section 5.1, we propose the following directions for future research: First, explore adaptive energy weight tuning methods, such as incorporating reinforcement learning or adaptive optimization algorithms. This would enable the model to dynamically adjust the weights of energy terms—including fluency and semantic constraints—based on input text attributes (e.g., topic, sentiment, and length). Such an approach would further enhance the precision and flexibility of multi-attribute control. Second, optimizing training protocols for large language models can involve exploring lightweight training methods (such as parameter-efficient fine-tuning and knowledge distillation) or designing specialized pre-training–fine-tuning paradigms. This enables MacHa to better adapt to the architectural characteristics of large language models, enhancing training and inference efficiency while maintaining generation quality. Finally, we hope this approach will bring value to more practitioners working in the field of controlled text generation.

Author Contributions

Conceptualization, D.X.; Methodology, D.X.; Software, Y.W.; Formal analysis, D.X.; Resources, M.L.; Data curation, Y.W.; Writing—original draft, D.X.; Writing—review & editing, M.L.; Supervision, M.L.; Funding acquisition, M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant number 62266033, and the Open Project of the Key Laboratory of Infinite-Dimensional Hamiltonian Systems and Their Algorithmic Applications (Ministry of Education) at Inner Mongolia Normal University under grant number 2023KFZD03.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (62266033). We gratefully acknowledge the supported by the Key Laboratory of Infinite-dimensional a Hamiltonian System and Its Algorithm Application (Inner Mongolia Normal University, Ministry of Education) (2023KFZD03).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pei, J.; Yang, K.; Klein, D. PREADD: Prefix-adaptive decoding for controlled text generation. arXiv 2023, arXiv:2307.03214. [Google Scholar] [CrossRef]
Ma, C.; Zhao, T.; Shing, M.; Sawada, K.; Okumura, M. Focused prefix tuning for controllable text generation. J. Nat. Lang. Process. 2024, 31, 250–265. [Google Scholar] [CrossRef]
Kumar, S.; Malmi, E.; Severyn, A.; Tsvetkov, Y. Controlled text generation as continuous optimization with multiple constraints. Adv. Neural Inf. Process. Syst. 2021, 34, 14542–14554. [Google Scholar]
Huang, X.; Liu, Z.; Li, P.; Li, T.; Sun, M.; Liu, Y. An extensible plug-and-play method for multi-aspect controllable text generation. arXiv 2022, arXiv:2212.09387. [Google Scholar]
Becker, J.; Wahle, J.P.; Gipp, B.; Ruas, T. Text generation: A systematic literature review of tasks, evaluation, and challenges. arXiv 2024, arXiv:2405.15604. [Google Scholar] [CrossRef]
Xie, Z. Neural text generation: A practical guide. arXiv 2017, arXiv:1711.09534. [Google Scholar] [CrossRef]
Liu, X.; Zheng, Y.; Du, Z.; Ding, M.; Qian, Y.; Yang, Z.; Tang, J. GPT understands, too. AI Open 2024, 5, 208–215. [Google Scholar] [CrossRef]
Zhu, C.; Liu, Y.; Lyu, C.; Yang, X.; Chen, G.; Wang, L.; Luo, W.; Zhang, K. Towards Lightweight, Adaptive and Attribute-Aware Multi-Aspect Controllable Text Generation with Large Language Models. arXiv 2025, arXiv:2502.13474. [Google Scholar]
Konen, K.; Jentzsch, S.; Diallo, D.; Schütt, P.; Bensch, O.; Baff, R.E.; Opitz, D.; Hecking, T. Style vectors for steering generative large language model. arXiv 2024, arXiv:2402.01618. [Google Scholar] [CrossRef]
Liu, S.; Ye, H.; Xing, L.; Zou, J. In-context vectors: Making in context learning more effective and controllable through latent space steering. arXiv 2023, arXiv:2311.06668. [Google Scholar]
Krause, B.; Gotmare, A.D.; McCann, B.; Keskar, N.S.; Joty, S.; Socher, R.; Rajani, N.F. Gedi: Generative discriminator guided sequence generation. arXiv 2020, arXiv:2009.06367. [Google Scholar] [CrossRef]
Yang, K.; Klein, D. FUDGE: Controlled text generation with future discriminators. arXiv 2021, arXiv:2104.05218. [Google Scholar] [CrossRef]
Sitdikov, A.; Balagansky, N.; Gavrilov, D.; Markov, A. Classifiers are better experts for controllable text generation. arXiv 2022, arXiv:2205.07276. [Google Scholar] [CrossRef]
Mudgal, S.; Lee, J.; Ganapathy, H.; Li, Y.; Wang, T.; Huang, Y.; Chen, Z.; Cheng, H.T.; Collins, M.; Strohman, T.; et al. Controlled decoding from language models. arXiv 2023, arXiv:2310.17022. [Google Scholar]
Zhong, Q.; Ding, L.; Liu, J.; Du, B.; Tao, D. ROSE doesn’t do that: Boosting the safety of instruction-tuned large language models with reverse prompt contrastive decoding. arXiv 2024, arXiv:2402.11889. [Google Scholar]
Choi, D.; Kim, J.; Gim, M.; Lee, J.; Kang, J. DeepClair: Utilizing Market Forecasts for Effective Portfolio Selection. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; pp. 4414–4422. [Google Scholar]
Maas, A.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 142–150. [Google Scholar]
Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 2015, 28, 649–657. [Google Scholar]
Ajwani, R.D.; Zhu, Z.; Rose, J.; Rudzicz, F. Plug and Play with Prompts: A Prompt Tuning Approach for Controlling Text Generation. arXiv 2024, arXiv:2404.05143. [Google Scholar] [CrossRef]
Li, W.; Wei, W.; Xu, K.; Xie, W.; Chen, D.; Cheng, Y. Reinforcement learning with token-level feedback for controllable text generation. arXiv 2024, arXiv:2403.11558. [Google Scholar] [CrossRef]
He, X. Parallel refinements for lexically constrained text generation with bart. arXiv 2021, arXiv:2109.12487. [Google Scholar] [CrossRef]
Zhang, H.; Wang, X.; Li, C.; Ao, X.; He, Q. Controlling large language models through concept activation vectors. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 25851–25859. [Google Scholar]
Liu, Y.; Liu, X.; Zhu, X.; Hu, W. Multi-aspect controllable text generation with disentangled counterfactual augmentation. arXiv 2024, arXiv:2405.19958. [Google Scholar] [CrossRef]
Huang, Y.; Chen, D.; Umrawal, A.K. JAM: Controllable and Responsible Text Generation via Causal Reasoning and Latent Vector Manipulation. arXiv 2025, arXiv:2502.20684. [Google Scholar] [CrossRef]
Chen, Z.; Feng, M.; Yan, J.; Zha, H. Learning neural Hamiltonian dynamics: A methodological overview. arXiv 2022, arXiv:2203.00128. [Google Scholar] [CrossRef]
Zaremba, W.; Sutskever, I.; Vinyals, O. Recurrent neural network regularization. arXiv 2014, arXiv:1409.2329. [Google Scholar]
Lee, D.D.; Pham, P.; Largman, Y.; Ng, A. Advances in neural information processing systems 22. Tech. Rep. 2009. [Google Scholar]
Zhang, J.; Zhu, Q.; Lin, W. Learning hamiltonian neural koopman operator and simultaneously sustaining and discovering conservation laws. Phys. Rev. Res. 2024, 6, L012031. [Google Scholar] [CrossRef]
Ding, H.; Pang, L.; Wei, Z.; Shen, H.; Cheng, X.; Chua, T.S. Maclasa: Multi-aspect controllable text generation via efficient sampling from compact latent space. arXiv 2023, arXiv:2305.12785. [Google Scholar]
Gu, Y.; Feng, X.; Ma, S.; Zhang, L.; Gong, H.; Qin, B. A distributional lens for multi-aspect controllable text generation. arXiv 2022, arXiv:2210.02889. [Google Scholar] [CrossRef]
De Smet, L.; Sansone, E.; Zuidberg Dos Martires, P. Differentiable sampling of categorical distributions using the catlog-derivative trick. Adv. Neural Inf. Process. Syst. 2023, 36, 30416–30428. [Google Scholar]
Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
Dathathri, S.; Madotto, A.; Lan, J.; Hung, J.; Frank, E.; Molino, P.; Yosinski, J.; Liu, R. Plug and play language models: A simple approach to controlled text generation. arXiv 2019, arXiv:1912.02164. [Google Scholar]
Liu, A.; Sap, M.; Lu, X.; Swayamdipta, S.; Bhagavatula, C.; Smith, N.A.; Choi, Y. DExperts: Decoding-time controlled text generation with experts and anti-experts. arXiv 2021, arXiv:2105.03023. [Google Scholar] [CrossRef]
Hallinan, S.; Liu, A.; Choi, Y.; Sap, M. Detoxifying text with marco: Controllable revision with experts and anti-experts. arXiv 2022, arXiv:2212.10543. [Google Scholar]
Mireshghallah, F.; Goyal, K.; Berg-Kirkpatrick, T. Mix and match: Learning-free controllable text generation using energy language models. arXiv 2022, arXiv:2203.13299. [Google Scholar] [CrossRef]
Qian, J.; Dong, L.; Shen, Y.; Wei, F.; Chen, W. Controllable natural language generation with contrastive prefixes. arXiv 2022, arXiv:2202.13257. [Google Scholar] [CrossRef]
Liu, G.; Feng, Z.; Gao, Y.; Yang, Z.; Liang, X.; Bao, J.; He, X.; Cui, S.; Li, Z.; Hu, Z. Composable text controls in latent space with ODEs. arXiv 2022, arXiv:2208.00638. [Google Scholar]
Liu, Z.; Lin, W.; Shi, Y.; Zhao, J. A robustly optimized BERT pre-training approach with post-training. In Proceedings of the China National Conference on Chinese Computational Linguistics, Hohhot, China, 13–15 August 2021; pp. 471–484. [Google Scholar]
Maaten, L.v.d.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. MacHa method flowchart: First, map the text through the VAE encoder and optimize the latent space distribution. Then, construct a Hamiltonian energy model to control the multi-dimensional characteristics of text generation. Finally, generate the target text sequence using reinforcement learning sampling and the VAE decoder.

Figure 2. t-SNE projection in latent space.

Table 1. Evaluation metrics results for the MacHa method and other baseline methods.

Method	Correctness (%) Senti and Topic ↑	Text Fluency PPL ↓	Diversity Dist-1 ↑	Diversity Dist-2 ↑	Efficiency Time (s)
PPLM	18.14 ± 0.45	25.59 ± 1.09	0.23	0.64	40.56
DEXPERTS	23.93 ± 1.11	38.70 ± 2.51	0.23	0.70	0.64
MaRCo	27.81 ± 1.94	18.87 ± 1.85	0.18	0.58	0.40
Mix and Match	50.17 ± 2.07	68.72 ± 0.97	0.36	0.84	164.60
Contrastive	53.02 ± 1.52	52.56 ± 11.97	0.22	0.71	0.59
LatentOps	44.41 ± 5.72	26.11 ± 1.46	0.16	0.55	0.10
Distribution	49.79 ± 1.99	12.48 ± 0.52	0.08	0.28	0.04
MacLaSa	59.18 ± 0.18	28.19 ± 1.26	0.16	0.60	0.10
MacHa	63.59 ± 2.64	27.76 ± 2.31	0.28	0.74	0.05
w/o $L_{C}$	49.32 ± 8.81	28.53 ± 1.43	0.20	0.68	0.05
w/o $L_{D}$	53.97 ± 4.25	27.37 ± 1.28	0.25	0.73	0.05
w/o $L_{E}$	55.28 ± 3.49	29.65 ± 0.77	0.22	0.71	0.05

Table 2. Detailed results for each baseline method across all eight combinations (two sentiment attributes × four topic attributes).

Method	Correctness (%) ↑	Text Quality PPL ↓	Diversity Distinct-1 ↑	Diversity Distinct-2 ↑	Senti and Topic AccT ↑
PPLM	Positive-Word	20.36 ± 1.69	25.47 ± 1.70	0.23	0.64
	Positive-Sport	16.53 ± 1.13	25.78 ± 1.30	0.23	0.63
	Positive-Business	25.24 ± 2.96	26.66 ± 1.66	0.24	0.64
	Positive-Sci/Tech	61.73 ± 0.66	25.06 ± 1.53	0.24	0.66
	Negative-Word	3.87 ± 1.99	25.27 ± 1.23	0.23	0.64
	Negative-Sport	2.27 ± 0.57	25.96 ± 1.54	0.23	0.63
	Negative-Business	1.78 ± 1.26	26.11 ± 1.20	0.23	0.64
	Negative-Sci/Tech	13.29 ± 1.82	24.40 ± 1.11	0.24	0.66
	Average	18.14 ± 0.45	25.59 ± 1.09	0.23	0.64
DEXPERTS	Positive-Word	34.22 ± 4.24	37.36 ± 3.46	0.24	0.72
	Positive-Sport	8.40 ± 2.66	37.36 ± 3.46	0.24	0.72
	Positive-Business	10.98 ± 1.67	37.36 ± 3.46	0.24	0.72
	Positive-Sci/Tech	45.02 ± 4.31	37.36 ± 3.46	0.24	0.72
	Negative-Word	9.47 ± 2.68	40.03 ± 2.35	0.21	0.68
	Negative-Sport	8.17 ± 2.27	40.03 ± 2.35	0.21	0.68
	Negative-Business	10.98 ± 1.50	40.03 ± 2.35	0.21	0.68
	Negative-Sci/Tech	63.64 ± 8.73	40.03 ± 2.35	0.21	0.68
	Average	23.93 ± 1.11	38.70 ± 2.51	0.23	0.70
MaRCo	Positive-Word	36.22 ± 8.04	17.13 ± 1.51	0.18	0.57
	Positive-Sport	37.11 ± 25.23	18.16 ± 1.47	0.17	0.55
	Positive-Business	38.89 ± 8.34	19.43 ± 2.13	0.19	0.59
	Positive-Sci/Tech	50.00 ± 5.21	17.91 ± 1.39	0.18	0.57
	Negative-Word	8.22 ± 4.91	18.79 ± 1.88	0.19	0.59
	Negative-Sport	10.89 ± 0.38	19.94 ± 2.85	0.17	0.57
	Negative-Business	22.89 ± 20.07	20.51 ± 2.45	0.19	0.59
	Negative-Sci/Tech	18.22 ± 5.39	19.06 ± 1.91	0.18	0.59
	Average	27.81 ± 1.94	18.87 ± 1.85	0.18	0.58
Mix and Match	Positive-Word	58.89 ± 0.83	61.27 ± 15.74	0.36	0.84
	Positive-Sport	70.31 ± 5.55	66.58 ± 8.74	0.35	0.84
	Positive-Business	39.78 ± 1.66	65.89 ± 14.35	0.35	0.84
	Positive-Sci/Tech	65.33 ± 2.49	69.07 ± 12.27	0.36	0.84
	Negative-Word	41.55 ± 1.66	69.49 ± 15.52	0.35	0.84
	Negative-Sport	47.33 ± 8.13	72.72 ± 8.87	0.36	0.84
	Negative-Business	31.56 ± 5.15	71.61 ± 15.14	0.35	0.84
	Negative-Sci/Tech	58.00 ± 4.75	73.08 ± 9.81	0.37	0.84
	Average	50.17 ± 2.07	68.72 ± 11.97	0.36	0.84
Contrastive	Positive-Word	67.87 ± 1.13	48.15 ± 15.74	0.23	0.72
	Positive-Sport	70.31 ± 5.55	52.36 ± 8.74	0.21	0.70
	Positive-Business	53.16 ± 5.00	56.13 ± 14.35	0.22	0.72
	Positive-Sci/Tech	51.96 ± 3.09	45.03 ± 12.27	0.23	0.71
	Negative-Word	40.94 ± 4.26	51.27 ± 15.52	0.22	0.70
	Negative-Sport	40.71 ± 10.65	59.77 ± 8.87	0.21	0.71
	Negative-Business	48.84 ± 6.95	61.91 ± 15.14	0.20	0.70
	Negative-Sci/Tech	50.40 ± 3.95	45.86 ± 9.81	0.23	0.71
	Average	53.02 ± 1.52	52.56 ± 11.97	0.22	0.71
LatenOps	Positive-Word	57.96 ± 5.07	24.79 ± 3.34	0.17	0.56
	Positive-Sport	63.47 ± 11.01	28.01 ± 1.80	0.16	0.55
	Positive-Business	61.73 ± 9.36	25.73 ± 1.84	0.14	0.52
	Positive-Sci/Tech	39.64 ± 22.07	26.49 ± 1.73	0.17	0.55
	Negative-Word	34.62 ± 1.59	24.98 ± 1.56	0.16	0.55
	Negative-Sport	40.41 ± 9.72	25.14 ± 1.48	0.14	0.52
	Negative-Business	25.74 ± 2.41	27.30 ± 2.11	0.15	0.54
	Negative-Sci/Tech	31.56 ± 2.53	26.49 ± 0.99	0.16	0.57
	Average	44.41 ± 5.72	26.11 ± 1.46	0.16	0.55
Distribution	Positive-Word	37.42 ± 4.38	13.34 ± 0.13	0.09	0.30
	Positive-Sport	71.60 ± 4.39	14.67 ± 0.53	0.09	0.29
	Positive-Business	72.80 ± 6.45	11.23 ± 1.00	0.07	0.25
	Positive-Sci/Tech	72.80 ± 11.07	12.41 ± 0.64	0.08	0.28
	Negative-Word	46.80 ± 10.89	11.89 ± 1.12	0.07	0.28
	Negative-Sport	35.91 ± 7.84	12.99 ± 0.57	0.08	0.28
	Negative-Business	26.09 ± 5.60	11.03 ± 0.11	0.07	0.25
	Negative-Sci/Tech	34.86 ± 6.25	12.25 ± 0.93	0.08	0.27
	Average	49.79 ± 1.99	12.48 ± 0.52	0.08	0.28
MacLaSa	Positive-Word	59.47 ± 6.66	26.26 ± 0.20	0.19	0.65
	Positive-Sport	87.93 ± 4.20	28.69 ± 1.78	0.16	0.57
	Positive-Business	82.87 ± 3.27	27.67 ± 1.55	0.15	0.57
	Positive-Sci/Tech	76.34 ± 0.46	28.77 ± 2.03	0.16	0.60
	Negative-Word	56.54 ± 1.47	26.28 ± 1.26	0.16	0.59
	Negative-Sport	38.00 ± 2.67	32.23 ± 0.20	0.17	0.61
	Negative-Business	31.40 ± 4.07	29.06 ± 1.12	0.15	0.59
	Negative-Sci/Tech	44.74 ± 0.34	31.95 ± 0.48	0.17	0.62
	Average	59.18 ± 0.81	28.19 ± 1.26	0.16	0.60
MacHa	Positive-Word	64.87 ± 1.96	25.83 ± 0.39	0.27	0.74
	Positive-Sport	89.16 ± 2.86	27.63 ± 2.96	0.27	0.75
	Positive-Business	83.61 ± 1.57	26.50 ± 0.45	0.30	0.77
	Positive-Sci/Tech	88.29 ± 4.51	25.37 ± 2.57	0.28	0.73
	Negative-Word	58.49 ± 0.97	25.49 ± 0.63	0.28	0.72
	Negative-Sport	41.27 ± 1.39	31.58 ± 1.67	0.28	0.73
	Negative-Business	32.55 ± 0.87	28.81 ± 1.21	0.29	0.76
	Negative-Sci/Tech	50.53 ± 2.38	30.92 ± 1.64	0.27	0.74
	Average	63.59 ± 2.64	27.76 ± 2.31	0.28	0.74

Table 3. Results of human evaluations.

Method	Correctness	Text Fluency	Efficiency
PPLM	1.96	2.67	2.54
DEXPERTS	1.98	2.38	1.88
MaRCo	2.08	2.78	2.65
Mix and Match	1.21	1.38	2.13
Contrastive	2.04	2.29	2.38
LatentOps	2.21	2.21	2.38
Distribution	2.67	2.67	2.63
MacLaSa	3.54	3.25	2.96
MacHa	4.02	3.86	3.25

Table 4. Evaluation results of different sampling methods.

Method	Correctness (Senti and Topic) ↑	Text Quality (PPL) ↓
Random	16.58	33.57
LD	31.47	10.46
ODE	59.23	28.91
RL	72.86	25.48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.