Bayesian Optimization for Instruction Generation

Antonio Sabbatella; Francesco Archetti; Andrea Ponti; Ilaria Giordani; Antonio Candelieri

doi:10.3390/app142411865

,

and

¹

Department of Computer Science Systems and Communication, University of Milano-Bicocca, 20126 Milano, Italy

²

Department of Economics Management and Statistics, University of Milano-Bicocca, 20126 Milano, Italy

³

OAKS s.r.l., 20125 Milan, Italy

^*

Author to whom correspondence should be addressed.

Appl. Sci.2024, 14(24), 11865;https://doi.org/10.3390/app142411865

This article belongs to the Special Issue Advances in Large Language Models: Techniques, Applications and Challenges

Version Notes

Order Reprints

Abstract

The performance of Large Language Models (LLMs) strongly depends on the selection of the best instructions for different downstream tasks, especially in the case of black-box LLMs. This study introduces BOInG (Bayesian Optimization for Instruction Generation), a method leveraging Bayesian Optimization (BO) to efficiently generate instructions while addressing the combinatorial nature of instruction search. Over the last decade, BO has emerged as a highly effective optimization method in various domains due to its flexibility and sample efficiency. At its core, BOInG employs Bayesian search in a low-dimensional continuous space, projecting solutions into a high-dimensional token embedding space to retrieve discrete tokens. These tokens act as seeds for the generation of human-readable, task-relevant instructions. Experimental results demonstrate that BOInG achieves comparable or superior performance to state-of-the-art methods, such as InstructZero and Instinct, with substantially lower resource requirements while also enabling the use of both white-box and black-box models. This approach offers both theoretical and practical benefits without requiring specialized hardware.

Keywords:

large language models; prompt optimization; Bayesian optimization

1. Introduction

Large Language Models (LLMs) have triggered an amazing amount of innovation in several domains of the machine learning arena. In this paper, we focus on the design of experiments and black-box optimization, which have not been extensively explored to date, despite the integration of LLMs and optimization, which present many opportunities. Foundational language models can be game changers in optimization, leveraging the enormous amount of information available in free-form text an entirely new approach to optimization task comprehension, exploiting wider contexts across new tasks and generalizing pre-trained models over unseen search spaces.

The considerations outlined in the Introduction were inspired by [1], which argues the potential of foundational models for enhancing black-box optimization and advocates for the adoption of transformers and LLMs to achieve this goal.

A general approach is proposed in [2] that introduces LLMs as optimizers, describing the optimization task through natural language. In each optimization step, the LLM generates new solutions from the prompts that contain the previously generated solutions with their values. The first problems considered are linear regression and the travelling salesman problem, with prompt optimization aimed at finding instructions that maximize the task-specific objective function. In [3], it is demonstrated that, using textual representations of mathematical values, LLMs can act as universal regressors. The proposed method, namely OmniPred, can take as input dynamically varying input spaces and does not require normalization. LLMs can also be used in the framework of evolutionary optimization [4] for single-objective and multi-objective evolutionary optimization [5,6].

Another computational framework for exploiting LLMs in black-box optimization is Bayesian Optimization (BO) [7]. The potential of the transformer architecture for Bayesian inference was represented in [8] with respect to In-Context Learning (ICL). More recently, in [9], it was shown how to frame the basic BO algorithm [10,11] in natural language terms, enabling LLMs to sequentially propose promising solutions conditioned on previous trials and observations. The proposed approach, namely LLAMBO (Large Language Model for Bayesian Optimization), addresses two critical problems: enhancing, through LLMs, the key components of BO, including the surrogate model and the acquisition function, and leveraging modules of different processes in the BO pipeline in natural language terms. In the basic BO algorithm and most of its extensions, the Gaussian Process (GP) [12,13] is the most common choice for the probabilistic surrogate model. Other methods have been attracting increasing attention. It is well known that neural networks are universal approximators, as the GP has key advantages of an analytical formula and a principled estimate of the uncertainty. On the other hand, neural networks, especially Bayesian Neural Networks (BNNs), have been also considered due to their flexibility in handling high-dimensional optimization problems. Moreover, the availability of computing power has brought to the fore the use of Monte Carlo methods for estimating uncertainty. Finally, transformers can provide another surrogate model with an advantage over the GP of integrating naturally contextual understanding, few-shot learning, and domain knowledge.

The relation between LLMs and BO is two-way: along with the role of LLMs in enhancing Bayesian optimization, another line is to exploit BO to improve prompt/instruction engineering to enable LLMs to solve a specific task. The focus in this paper is on the second objective, analyzing recent BO methods and proposing a new and less resource-consuming approach.

1.1. Organization of the Paper

Section 2 provides a synthetic reference to the main approaches for prompt/instruction optimization. Section 3 details the approach proposed in this paper, namely BOInG (Bayesian Optimization for Instruction Generation) (Figure 1). Section 4 provides the experimental settings and the results. Section 5 provides concluding remarks on, limitations of, and perspectives of BOInG and, more generally, BO in working with LLMs.

Figure 1. High-level BOInG instruction generation workflow and components.

1.2. Contributions

Form the methodological point of view, a new strategy is proposed to deal with the combinatorial nature of the problem. Instead, of performing BO in low-dimensional continuous space, then trying to recast the solution to the closest possible text, a penalty is included in the BO acquisition function to push the search for promising solutions towards low-dimensional representations of texts known to the LLMs.

As most of current approaches, BOInG works with two LLMs, with the first used as an instructor generator and the second used as a solver for a specific task. While other state-of-the art approaches–discussed in the following subsection “related works”–require that at least one of the two LLMs is a white-box model, in BOInG, both the LLMs can be black-box models, leading to a significant reduction in computational resource use “on premises”. BOInG overcomes these limitations by leveraging two black-box LLMs, currently implemented with GPT-3.5 but extensible to more advanced closed-source models like GPT-4o. This is particularly advantageous, as the highest-performing models are often closed-source, while open-source alternatives typically offer lower quality and require larger architectures to achieve comparable results, further increasing computational demands and necessitating multi-GPU setups. BOInG accesses these models via an API in an “as-a-service” paradigm, only requiring the embedding of a white-box LLM (specifically, GPT-2) for penalty computation to apply to the acquisition function. By eliminating the need to run the full LLM locally, BOInG substantially reduces computational requirements while maintaining the optimization paradigm.

Figure 2 summarizes the costs of BOInG against those of two state-of-the-art methods, namely InstructZero [14] and Instinct [15], as described in the following.

Figure 2. Comparison of computational resources and costs across three BO-based instruction generation approaches. Information is related to the GPU type, parameter count, memory needed, FLOPS, hourly cost, and cloud provider for BOInG, InstructZero, and Instinct. It is evident that BOInG uses significantly fewer parameters and requires significantly less memory.

2. Related Works

In [16], a prompt (

p \in V^{l}

) is defined as a sequence of

l

n-grams. The tokens of the original model’s vocabulary are merged in n-grams based on their Pointwise Mutual Information (PMI) and considered as prompt candidates. This ensures that only n-grams of tokens that frequently appear together are used to form the actual vocabulary (V).

The final goal is to find the prompt (

p^{*}

) that maximizes a scoring function (

h

). The scoring function (

h

) utilized in the prompt optimization framework is defined as the classification score between the predicted label and the ground-truth label. The algorithm proposed in [16] is analyzed is Section 4; it is based on vanilla BO over a continuous relaxation of the combinatorial space and a successive rounding of the continuous solution to the closest integer-valued solution.

On the contrary, InstructZero [14] deals with discrete instructions by optimizing over a low-dimensional continuous latent space and using a linear projection to map the result back to the original discrete space. The soft prompt in the continuous latent space is given as input to an open-source LLM whose output is an instruction which, in turn, is given a black-box LLM as input. BO is used to identify new soft prompts aimed at improving the zero-shot performance.

Another approach using two LLMs is named Instinct [15], which moves from the observation that gaussian processes (GPs), which are usually adopted as probabilistic surrogate models in BO, perform poorly for sophisticated or high-dimensional objective functions. This can become a serious drawback attributed to the representational power of GPs, which are usually unsuitable when a soft prompt is the argument of the mapping. Owing to this drawback, growing attention is being paid to transformer architectures, which have empirically shown not only a better representation capacity but also effectiveness in balancing exploration and exploitation, in alignment with GPs.

The method proposed in this paper is called BOInG (Bayesian Optimization for Instruction Generation), and it is also based on two LLMs: one used as an instruction generator and the other used as a solver of the target task. In contrast to InstructZero and Instinct, in which one LLM must be a white-box model, in BOING, both LLMs can be black-box models.

2.1. Common Background and Notation

A prompt (

p

) is a sequence of a given length (

L

) of n-grams or of individual tokens selected from a vocabulary (

V

). The goal is to engineer prompts to address a specific task with an input space (

X

) and an output defined as

f (p, X)

, where

X \in X

. Usually, a query (

q

) is denoted as

c o n c a t e n a t e (p, X)

.

The generation of the most suitable prompt for a given task is formulated as an optimization problem with an objective function that measures the performance on a given task, that is,

h (f (p, X), Y)

. Common choices for the score function are evaluation metrics such as accuracy or F1 for classification tasks and mean squared error for regression tasks, computed by comparing

f (p, X)

against the ground truth (

Y

).

If the pair

(X, Y)

is assumed to be drawn from a task distribution (

D

), we obtain the following stochastic optimization formulation:

\max_{p \in V^{L}} E (X, Y) = (h (f (p, X), Y))

(1)

The search space of problem (1), that is,

V^{L}

, consists of all the possible prompts of length

L

and whose components are elements of the vocabulary (

V

).

Prompt engineering methods can be split into two categories: Hard Prompt Tuning (HPT), which directly searches for an optimal prompt in the combinatorial search space (

V^{L}

), and Soft Prompt Tuning (SPT), which uses continuously valued language embeddings and searches for the optimal embedding via gradient-based optimization in the resulting continuous latent space.

Given the dimension of the vocabulary (

V

) and the prompt length (

L

), HPT is an intractable combinatorial optimization problem characterized by a search space consisting of

{| V |}^{L}

possible solutions (in the case that duplicated n-grams are allowed in the prompt), with

|V| ≫ L

representing the size (i.e., the number of terms) of the vocabulary.

Different modeling and algorithmic strategies have been proposed for prompt optimization. A basic categorization of prompt optimization methods is provided in [16], along with differentiations of continuous vs. discrete and black-box vs. white-box methods. Another summarization along the same lines is reported in [15].

2.2. Hard Prompt Tuning via (Vanilla) Bayesian Optimization

The authors of [16] proposed a method for hard prompt tuning (HPT) of LLMs via BO. The method addresses the challenge of finding optimal prompts in the vast combinatorial space of possible n-gram sequences, where the objective is to maximize a task-specific scoring function over a dataset. The key innovation is the adoption of a continuous relaxation of the discrete n-grams, specifically mapping the discrete n-grams into integer-valued vectors belonging to a continuous space whose components are the indices of the tokens in the dictionary. The next solution (i.e., a new hard prompt) to evaluate is obtained by optimizing an acquisition function over the continuous space, balancing between exploitation and exploration; then, the continuous components of the obtained solution are rounded to the closest integer values to retrieve the associated tokens in the vocabulary. This mechanism allows us to leverage BO for HPT.

More specifically, to solve problem (1), BO approximates the scoring function over the continuously relaxed prompt space. After having evaluated

n

prompts

,

a Gaussian process (GP) regression model is fitted to approximate the posterior mean (

μ (p)

) and variance (

σ^{2} (p)

) for any continuously relaxed prompt (

p

) according to the following two equations:

μ (p) = k (p, P_{1 : n}) {[K + λ^{2} I]}^{- 1} h

(2)

σ^{2} (p) = k (p, p) - k (p, P_{1 : n}) {[K + λ^{2} I]}^{- 1} k (P_{1 : n}, p)

(3)

where

P_{1 : n} = {\{p^{(i)}\}}_{i = 1, \dots, n}

is the set of evaluated prompts,

h = {\{h^{(i)}\}}_{i = 1, \dots, n}

are the associated scores,

k (., .)

is the kernel function,

K

is the

n \times n

kernel matrix with its entries computed as

K_{i j} = k (p^{(i)}, p^{(j)})

,

I

is the identity matrix, and

λ^{2}

is the noise variance.

The next prompt (

p^{(n + 1)}

) is chosen by optimizing an acquisition function balancing between exploration and exploitation. In this paper, we use the well-known Upper Confidence Bound (UCB), which is defined as

U C B (p) = μ (p) + β σ (p)

(4)

and represents the most optimistic estimate of the score for any prompt (

p

) depending on the current GP model. Then, the actual score associated to the suggested prompt (

p^{(n + 1)}

) is evaluated, and the GP model is updated according to the new observation. The BO algorithm continues iteratively until a maximum number of prompts has been suggested and evaluated.

A graphical representation of the proposed approach is reported in Figure 2, providing more details on the BO components and their roles.

2.3. InstructZero

A more sophisticated approach is proposed in [14], named InstructZero. Instead of directly optimizing the prompt, which is an instruction to the LLM, InstructZero uses BO to search for the optimal prompt in a low-dimensional space associated with the embedding of an open-source LLM, which generates a human-readable and task-relevant instruction given a few examples of the target task. The instruction is then submitted to the black-box LLM. The entire pipeline is reported in Figure 3 and detailed in Figure 4.

Figure 3. Schematic representation of the iterative Bayesian optimization loop for prompt optimization in InstructZero (source: [14]).

Figure 4. Detailed pipeline of the Bayesian optimization process in InstructZero (source: [14]). The flow chart depicts the progression from soft prompts and instructions through open-source and black-box LLMs.

InstructZero uses an open-source LLM to convert a soft-prompt (

p

) into an instruction that is not only task-relevant but also human-readable. The black-box LLM uses this instruction to perform zero-shot prediction, while BO estimates the relation linking every soft prompt to its associated score.

Random projection is largely used because its distance-preserving property it suitable for construct kernels. This has the important implication that the behavior of BO is consistent across the original space and the low-dimensional space. This property is particularly important for in-context learning, in that low-dimensional soft prompts can produce task-relevant instructions. The similarity between two prompts is measured in terms of correlation, usually by a Matern or a square exponential kernel. InstructZero proposes a kernel function that is task-specific:

s (v_{i}, v_{j}) = E_{X \sim D_{t}} [s i m (f ([v_{i}; X]), f ([v_{j}; X]))]

(5)

where

s i m (., .)

is the similarity of the predictions for the tasks (e.g., exact match, F1, or BLEU score). Then, the two kernels are combined in an instruction-coupled kernel, which, applied to the soft prompts, recovers the instruction matrix. Then, the instruction-coupled kernel drives the combination of the two kernels and aligns BO in the latent space. The structure of the algorithm is captured in the following pseudo code (Algorithm 1).

Algorithm 1: InstructZero

Input:
Examples

(\bar{X}, \bar{Y})

and a validation set

(X, Y)

, instruction generator LLM

g (.)

, solver
LLM

f (.)

, maximal steps

T

, random projection matrix

A \in R^{d \times q}

.

Initialize:

P \sim U n i f o r m (R^{d})

;

m \leftarrow 1

;

V \leftarrow \emptyset

;

H \leftarrow \emptyset

.

while

m \leq T

do:
generate instruction

v_{m} \leftarrow g (A p_{m}; (\bar{X}, \bar{Y}))

through the open-source LLM

g (.)

evaluate zero-shot score

h_{m} \leftarrow h (f (v_{m}, X), Y)

on the black-box LLM

f (.)

P \leftarrow P \cup {p_{m}}

V \leftarrow V \cup {v_{m}}

H \leftarrow H \cup {h_{m}}

update the instruction-coupled kernel matrix for P
update the posterior mean and variance function of BO
find the next prompt

p_{m + 1}

maximizing the acquisition function (i.e., Expected
Improvement in InstructZero)

m \leftarrow m + 1

End

Output:
the best instruction

v_{i^{*}}

so far, where

i^{*} \in a r g \max_{i \in [m]} h_{i}

3. Bayesian Optimization for Instruction Selection (BOInG)

In this section, we detail our proposed approach, BOInG. First, we introduce some useful notations:

M_{g}

: The LLM working as instruction generator. It receives a hard prompt and a small set of examples (

(\bar{X}, \bar{Y}) \sim D

) as input and produces a text representing an instruction.

M_{s}

: The LLM working as solver for a certain task. It receives an instruction and a large set of input examples, that I,

X

from

(X, Y) \sim D

, as input and provides its own predictions (

\hat{Y} \approx Y

).

w \in W^{τ}

: A hard prompt consisting of

τ

tokens given as input to

M_{g}

, with

W

, a vocabulary, preferably the one which

M_{g}

has been trained on.

E = {\{e_{i}\}}_{i = 1}^{|W|}

: The set of the token embeddings such that

e_{i} = e n c o d e (W_{[i]})

and

W_{[i]} = d e c o d e (e_{i})

, where

W_{[i]}

represents the

i

-th token in the vocabulary (

W

).

p \in P \subset R^{d}

: A soft prompt in a convenient low-dimensional search space.

We consider the same workflow as in InstructZero [14] and Instinct [15], that is,

M_{g}

works as an instruction generator; then,

M_{s}

works as a solver for a certain task by using the instruction generated by

M_{g}

. The workflow starts by injecting a hard prompt (

w \in W^{τ}

) into

M_{g}

, along with a small set of examples (

(\bar{X}, \bar{Y}) \sim D

). The generated instruction, denoted by

M_{g} (w ⨁ (\bar{X}, \bar{Y}))

, is given as input to

M_{s}

, along with a large set of input examples (

X

) whose associated output (

Y

) must be predicted. The symbol

⨁

denotes the concatenation operator.

The final aim is to efficiently search for

w^{*} \in \underset{w \in W^{τ}}{argmax} L ({w; M}_{g}, M_{s}, (\bar{X}, \bar{Y}), (X, Y))

(6)

with the loss function defined as follows:

L ({w; M}_{g}, M_{s}, (\bar{X}, \bar{Y}), (X, Y)) = \frac{1}{N} \sum_{i = 1}^{N} 1_{y_{i}} = M_{s} (M_{g} (w ⨁ (\bar{X}, \bar{Y})) ⨁ x_{i})

(7)

where

1_{a = b}

denotes the indicator function (equal to 1 if and only if

a = b

and 0 otherwise) and

M_{s} (M_{g} (w ⨁ (\bar{X}, \bar{Y})) ⨁ x_{i})

is the output provided by the second LLM (

M_{s}

), given the input (

x_{i}

) and the generated instruction (

M_{g} (w ⨁ (\bar{X}, \bar{Y}))

). The entire optimization process is summarized in Figure 5.

Figure 5. Flow chart of a prompt optimization loop, starting with a hard prompt and small example set. The process involves instruction generation via closed-source LLMs (

M_{g}

), prompt loss calculation using the

M_{g}

’s generated output with as input for the generated instruction and training set, and Bayesian optimization. This cyclical workflow illustrates the generation of instructions in the BOInG algorithm.

Solving problem (6) is difficult due to the combinatorial nature of the search space, that is,

W^{τ}

. As already addressed in the recent literature, our idea is to, instead, use a soft prompt (

p \in P \subset R^{d \times τ}

), where

τ

is the number of tokens and

d

is the dimensionality of the latent representation of each token. The most natural and suitable choice for the latent representation of the tokens should be

M_{g}

’s embedding space (

Ω \in R^{q}

), which, unfortunately, is usually high-dimensional, i.e.,

q ≫ d

. Accounting also for

τ

, the final search space is

Ω^{τ} \in R^{q \times τ}

; on the contrary, we want to identify a conveniently low-dimensional search space (

P \in R^{d \times τ}

) to perform BO.

As suggested in the recent literature, we use a random projection matrix (

A \in R^{d \times q}

), with entries sampled from a normal or uniform distribution to project any soft prompt (

p

) from our conveniently low-dimensional search space (

P

) into an associated point in the high-dimensional search space (

Ω^{τ}

). Thus, each

d

-dimensional token (

p_{[i]}

) of the prompt (

p

) is projected into an embedding (

A p_{[i]}

). Random projection is a quite common procedure known to be distance-preserving. However, the projection of the soft prompt is still not a hard prompt, which is, instead, needed as an input of the instruction generator (

M_{g}

).

It is important to remark that the embeddings of the tokens used to train

M_{g}

all lie within the high-dimensional space (

Ω

) and are denoted as set

E = {\{e_{i}\}}_{i = 1}^{|W|}

, where

e_{i} = e n c o d e (W_{[i]})

and

W_{[i]} = d e c o d e (e_{i})

. Thus, the most naïve strategy is to recast every projection (

A p_{[i]}

) into the closest

e_{j} \in E

, then retrieve the associated token (

W_{[j]}

). This allows for the retrieval of a hard prompt (

w \in W^{τ}

) given the random projection of a given soft prompt (

p \in P

) the computation of the associated loss. Although this is a possible procedure, we later find that it can be largely ineffective.

At a generic iteration (

n

) of BO, the trials (

p^{(1)}, \dots, p^{(n)}

) and the associated losses (

L^{(1)}, \dots, L^{(n)}

) are used to train a GP approximating a back-box and expensive loss function with respect to soft prompts (

p

) in the low-dimensional space (

P

). Selection of the next prompt to try (

p^{(n + 1)}

) is performed by optimizing an acquisition function balancing between exploration and exploitation. However, due to random projection induced by

A

, we are not sure that the projected point (

A p

) is “consistent” with the (embeddings of the) tokens known to

M_{g}

. In simpler terms, in a very high-dimensional space (

Ω

), the projected point (

A p_{[i]}

) could be far away from the embedding of any token known to

M_{g}

, leading the LLM to generate incomprehensible and strange instructions that are difficult for

M_{s}

to successively interpret.

Indeed, we introduce a penalty function (

π (p)

) so that UCB is optimized while keeping all the projections (

A p_{[i]}

) associated with the soft prompt (

p \in P

) as close as possible to the embeddings of the tokens known to

M_{g}

. Our penalty function is defined as follows:

π (p) = \frac{1}{τ} \sum_{i = 1}^{τ} \min_{e \in E} ‖A p_{[i]} - e‖

(8)

In more simple terms,

π (p)

is the average distance between each projection (

A p_{[1]}, \dots, A p_{[τ]}

) from the closest embedded tokens known to

M_{g}

. Finally,

p^{(n + 1)}

is obtained as follows:

p^{(n + 1)}, w^{(n + 1)} \in \underset{p \in P}{argmax} μ (p) + β σ (p) - C π (p)

(9)

where

C

is a regularization hyperparameter and

β

manages the exploration–exploitation trade-off. Indeed, the usual UCB is penalized by the quantity expressed as

C π (p)

.

It is important to remark that solving the penalized UCB requires computation of

π (p)

; thus, we obtain

p^{(n + 1)}

, along with the associated indices (

j_{i}^{*} = \underset{j = 1, \dots, |W|}{argmin} ‖A p_{[i]} - e_{j}‖

) for each

i = 1, \dots, τ

, from which we directly obtain the associated hard prompt (

w^{(n + 1)}

) such that

w_{i}^{(n + 1)} = d e c o d e (e_{j_{i}^{*}}) = W_{[j_{i}^{*}]}

. This is crucial: we are searching in the conveniently low-dimensional search space (

P

) but close to known embeddings, which is completely different form optimizing UCB “freely”, then recasting

p^{(n + 1)}

to the closest embeddings a posteriori. In the second case, UCB could lead to a

p^{(n + 1)}

so far away from all the known tokens (due to the high dimensionality of

Ω^{τ}

) that recasting could be completely incoherent.

Finally, knowing the embeddings of the tokens known to

M_{g}

is the best option for the proposed approach; however, this requires

M_{g}

to be a white-box model. To mitigate this request and give the chance to, instead, use powerful black-box LLMs, we decided to only adopt the embeddings of another white-box LLM–specifically, GPT2 in our case–under the assumption that the distances between embeddings should be coherent across different LLMs.

GPT-3.5 Turbo is used in BOInG as an LLM. For each task, we use the following parameter settings: 5 and 20 samples from the training and validation sets, respectively. The number of tokens in every soft prompt is 5. The elements of the random projection matrix are sampled from a uniform distribution in

[- 1,1]

. The value of

d

is set to 10, and 25 soft prompts are explored for each iteration. Finally,

β = 1

and

C = 1

.

We utilized an evolutionary search algorithm, namely “SampleReducingMCAcquisitionFunction”, to find the top 25 soft prompts. All training and testing were performed on a 4-core machine equipped with an NVIDIA T4 GPU, which accelerated the matrix calculations necessary for determining the penalty (obtaining the embedding distances). The BOInG algorithm is summarized in the following (Algorithm 2).

Algorithm 2: BOInG (Bayesian Optimization for Instruction Generation)

Input: Examples

(\bar{X}, \bar{Y})

, validation set

(X, Y)

, instruction generator LLM

M_{g}

, solver LLM

M_{s}

, maximal steps

T

, the dimensionality

d

of the search space, number of tokens τ, vocabulary

W

, embeddings set

E = {\{e_{i}\}}_{i = 1}^{|W|}

, random projection matrix

A \in R^{d \times q}

.

Initialize:

P \leftarrow S o b o l_{i n i t i a l i z a t i o n}

(n = 10)

V \leftarrow \emptyset

;

L \leftarrow \emptyset

for

p \in P

do
for i = 1 to τ do

j_{i}^{*} \leftarrow a r g m i n_{j = 1, \dots, |W|} ||A p_{[i]} - e_{j}||

w_{[i]} \leftarrow d e c o d e (e_{j_{i}^{*}}) = W_{[j_{i}^{*}]}

{v \leftarrow {M}_{g} (w_{[i]} \oplus (\bar{X}, \bar{Y}))}

V \leftarrow V \cup \{v\}

L \leftarrow L \cup \{1 / N \sum_{i = 1}^{N} 1 (y_{i} = M_{s} (v \oplus x_{i}))\}

end for
end for

while

|L| \leq T

do
train a GP model using (

P, L

)

p^{'}, w^{'} \leftarrow \underset{p \in P}{argmax} (μ (p) + β σ (p) - C π (p))

//Penalized UCB

v^{'} \leftarrow M_{g} ([w^{'} \oplus (\bar{X}, \bar{Y})])

//Generate instruction using Instruction Generator

L \leftarrow L \cup \{1 / N \sum_{i = 1}^{N} 1 (y_{i} = M_{s} (v_{n} \oplus x_{i}))\}

//Evaluate instruction

P \leftarrow P \cup p^{'}

,

V \leftarrow V \cup v^{'},

end while

Output:
The best instruction

V_{[i^{*}]}

with

i^{*} \in \underset{i = 1 : |L|}{argmax} L_{[i]}

Where:

π (p) = 1 / τ \sum_{i = 1}^{τ} \min_{e \in E} ||A p_{[i]} - e||

//Penalty function

4. Computational Results

4.1. Evaluation Metrics

Different score functions were considered depending on the specific task. In detail, we used the following four evaluation metrics:

Exact Match (EM): The correct response yields

E M = 1

and

E M = 0

otherwise. The associated tasks are “Letters list” and “Second word letter”.

Exact Set (ES): When evaluating each question-and-answer pair,

E S = 1

if the predicted response precisely matches the correct responses set and

E S = 0

otherwise. The associated task is “Taxonomy”.

Contain: This measure has a value of Contain = 1 in the case that the predicted answer is part of the correct responses and Contain = 0 otherwise. The associated tasks are “Word Sorting” and “Synonyms”.

F1: The usual F1 score calculated by comparing individual words in the predicted response to those in the actual or true Answer. The common words between the predicted and actual answers form the basis for the F1 score. Precision is determined by the proportion of common words to the total words in the predicted response, while recall is calculated as the proportion of common words to the total words in the actual answer. The associated task is “Common”.

4.2. Selected Tasks

The tasks were selected from [14]; due to budgetary constraints, we selected a subset of 6 among the original 24 tasks to reduce the required testing and implementation time. The selected tasks were chosen to be representative of different levels of difficulty (based on the scores reported for APE, Uniform, and InstructZero).

Easy tasks:

Letter list: Provide the input string, separating each character with a space.

Word sorting: With a list as input, the task is to produce an output that is the list in alphabetical order.

Challenging tasks:

Taxonomy: Create a program that generates a list of animals based on the provided input.

Synonyms: Provide the synonyms of the input word.

Second word letter: Taking a string as input, return the second character.

Common: Extract the relationship between the input words.

Table 1 summarizes the results obtained by BOInG on the selected tasks and compares them against three other methods. BOInG’s test score is in line with other BO-based approaches (i.e., Instinct and InstructZero) and outperforms them on two tasks.

Table 1. Test scores comparing BOInG against APE, Instinct, and InstructZero (* for APE and Instinct results from [15]; ** for InstructZero results from [14]).

An important consideration is the significantly different costs entailed by the three methods. BOInG requires the invocation of

M_{g}

via an API, implying a minimal additional cost of approximately USD 0.01 per 20′270 tokens (using GPT-3.5-turbo-0125, ANI Technologies Pvt.Ltd., Bengaluru, India). This cost is significantly lower than that required to run white-box LLMs. For instance, the GPU time required for InstructZero (using an NVIDIA A6000, NVIDIA, Santa Clara, CA, USA) would cost approximately USD 0.01 for just 25 s, during which only a few local LLM calls could be executed.

Finally, we provide some examples of generated instructions and associated hard prompts over the BO process. In Table 2, the “hard prompt” column displays the hard prompt prepended to the input given to the instruction generator LLM (

M_{g}

), consisting of seemingly random character strings that represent the algorithm’s exploration in the prompt space. The Instruction column shows the human-readable, task-relevant instructions generated by

M_{g}

based on these hard prompts and task examples. The “score (train)” column indicates the performance of each instruction, ranging from 0 to 1, with higher scores denoting better performance. The table captures the optimization process across six rows, including initialization and subsequent iterations. Initially, two different hard prompts yield similar instructions with identical scores of 0.40. The third iteration shows a marginal improvement, achieving a score of 0.45. The fourth row demonstrates the best performance, reaching a score of 0.55 with the instruction, “Provide synonymous terms for the given words”. This table showcases the algorithm’s progression in identifying increasingly effective instructions for the synonym task.

Table 2. The presented table illustrates the iterative process of instruction optimization for a synonyms task (no further improvement after the 3rd BO iteration).

5. Conclusions, Limitations, and Perspectives

BOInG demonstrates the versatility and efficiency of Bayesian Optimization (BO) for black-box problems in tackling the highly structured combinatorial challenges of prompt tuning and instruction generation in LLMs.

By leveraging an “as-a-service” API paradigm and treating both models as black boxes, BOInG avoids the need for the white-box instruction generation model’s architecture, offering greater flexibility. This approach enables BOInG to utilize high-performing, closed-source models like GPT-4o or Claude 3.5 Sonnet without requiring significant local computational resources. Unlike methods relying on open-source LLMs that necessitate dedicated hardware such as GPUs, BOInG only requires penalty loss computations using the GPT-2 embedding layer, which is orders of magnitude smaller than the LLMs used in other approaches. Experimental results confirm BOInG’s competitive performance compared to state-of-the-art methods like InstructZero and Instinct, with substantially lower resource requirements.

While effective, BOInG relies on embeddings from GPT-2 for penalty function calculations, which may introduce coherence challenges due to the mismatch between its embedding layer and that of the primary LLM. Addressing this limitation through experimentation with alternative embeddings or model-specific penalty functions could improve compatibility and performance.

Despite the successes of BO, its limitations include challenges in handling categorical variables and problems with many variables or images, where gaussian processes (GPs) tend to underperform. For such cases, Bayesian neural networks have been suggested as promising alternatives, offering the ability to flexibly represent non-stationary behavior and handle multi-output objectives. Replacing GPs with Bayesian neural networks could enhance the robustness of BOInG in these scenarios.

An entirely different perspective is to generalize the capabilities of LLMs beyond natural language tasks, generating candidate solutions for the BO process with limited data in contexts requiring generalization from few examples.

Author Contributions

Conceptualization, F.A., I.G. and A.C.; Methodology, F.A., A.P. and A.C.; Software, A.S. and A.P.; Validation, A.P.; Investigation, I.G.; Resources, I.G.; Data curation, A.S.; Writing—original draft, F.A. and A.C.; Writing—review & editing, A.C.; Visualization, A.S.; Supervision, I.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in public repositories from previous papers reporting the associated urls. InstructZero (https://github.com/Lichang-Chen/InstructZero (accessed on 6 December 2024); Instinct (https://github.com/xqlin98/INSTINCT (accessed on 6 December 2024)).

Conflicts of Interest

Authors Andrea Ponti and Ilaria Giordani were employed by the company OAKS s.r.l. The remaining authors declare that the re-search was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Song, X.; Tian, Y.; Lange, R.T.; Lee, C.; Tang, Y.; Chen, Y. Position Paper: Leveraging Foundational Models for Black-Box Optimization: Benefits, challenges, and Future Directions. arXiv 2024, arXiv:2405.03547. [Google Scholar]
Liu, S.; Chen, C.; Qu, X.; Tang, K.; Ong, Y.S. Large language models as evolutionary optimizers. arXiv 2023, arXiv:2310.19046. [Google Scholar]
Song, X.; Li, O.; Lee, C.; Peng, D.; Perel, S.; Chen, Y. Omnipred: Language models as universal regressors. arXiv 2024, arXiv:2402.14547. [Google Scholar]
Lange, R.T.; Tian, Y.; Tang, Y. Evolution Transformer: In-Context Evolutionary Optimization. arXiv 2024, arXiv:2403.02985. [Google Scholar]
Liu, F.; Lin, X.; Wang, Z.; Yao, S.; Tong, X.; Yuan, M.; Zhang, Q. Large language model for multi-objective evolutionary optimization. arXiv 2023, arXiv:2310.12541. [Google Scholar]
Liu, F.; Tong, X.; Yuan, M.; Zhang, Q. Algorithm evolution using large language model. arXiv 2023, arXiv:2311.15249. [Google Scholar]
Li, J.; Song, F.; Jin, Y.; Qiang, W.; Zheng, C.; Sun, F.; Xiong, H. BayesPrompt: Prompting Large-Scale Pre-Trained Language Models on Few-shot Inference via Debiased Domain Abstraction. arXiv 2024, arXiv:2401.14166. [Google Scholar]
Müller, S.; Hollmann, N.; Arango, S.P.; Grabocka, J.; Hutter, F. Transformers can do bayesian inference. arXiv 2021, arXiv:2112.10510. [Google Scholar]
Liu, T.; Astorga, N.; Seedat, N.; van der Schaar, M. Large language models to enhance bayesian optimization. arXiv 2024, arXiv:2402.03921. [Google Scholar]
Archetti, F.; Candelieri, A. Bayesian Optimization and Data Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2019; Volume 849. [Google Scholar]
Garnett, R. Bayesian Optimization; Cambridge University Press: Cambridge, UK, 2023. [Google Scholar]
Williams, C.K.; Rasmussen, C.E. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006; Volume 2, p. 4. [Google Scholar]
Gramacy, R.B. Surrogates: Gaussian Process Modeling, Design, and Optimization for the Applied Sciences; Chapman and Hall/CRC: Boca Raton, FL, USA, 2020. [Google Scholar]
Chen, L.; Chen, J.; Goldstein, T.; Huang, H.; Zhou, T. InstructZero: Efficient instruction optimization for black-box large language models. arXiv 2023, arXiv:2306.03082. [Google Scholar]
Lin, X.; Wu, Z.; Dai, Z.; Hu, W.; Shu, Y.; Ng, S.K.; Jaillet, P.; Low, B.K.H. Use your instinct: Instruction optimization using neural bandits coupled with transformers. arXiv 2023, arXiv:2310.02905. [Google Scholar]
Sabbatella, A.; Ponti, A.; Giordani, I.; Candelieri, A.; Archetti, F. Prompt Optimization in Large Language Models. Mathematics 2024, 12, 929. [Google Scholar] [CrossRef]

Figure 1. High-level BOInG instruction generation workflow and components.

Figure 2. Comparison of computational resources and costs across three BO-based instruction generation approaches. Information is related to the GPU type, parameter count, memory needed, FLOPS, hourly cost, and cloud provider for BOInG, InstructZero, and Instinct. It is evident that BOInG uses significantly fewer parameters and requires significantly less memory.

Figure 3. Schematic representation of the iterative Bayesian optimization loop for prompt optimization in InstructZero (source: [14]).

Figure 4. Detailed pipeline of the Bayesian optimization process in InstructZero (source: [14]). The flow chart depicts the progression from soft prompts and instructions through open-source and black-box LLMs.

Figure 5. Flow chart of a prompt optimization loop, starting with a hard prompt and small example set. The process involves instruction generation via closed-source LLMs (

M_{g}

), prompt loss calculation using the

M_{g}

’s generated output with as input for the generated instruction and training set, and Bayesian optimization. This cyclical workflow illustrates the generation of instructions in the BOInG algorithm.

Table 1. Test scores comparing BOInG against APE, Instinct, and InstructZero (* for APE and Instinct results from [15]; ** for InstructZero results from [14]).

Task	Difficulty	BOInG	APE **	Instinct **	InstructZero *
Synonyms	Challenging	0.30	0.36	0.30	0.38
Common	Challenging	0.17	0.07	0.21	0.15
Letter List	Easy	1.00	1.00	1.00	1.00
Taxonomy	Challenging	0.82	0.35	0.85	0.82
Word Sorting	Easy	0.65	0.33	0.51	0.64
Second Word Letter	Challenging	0.93	0.75	0.10	0.62

Table 2. The presented table illustrates the iterative process of instruction optimization for a synonyms task (no further improvement after the 3rd BO iteration).

	Hard Prompt	Instruction	Score (Train)
Best on random initialization	waalf addingolatedprom	Provide synonymous words for each input.	0.40
Best at 1st BO iteration	waalf addedolated leng	Provide alternative words for each given input.	0.40
(Two) best at 2nd BO iteration	ceryalfPOST withdrawnitia veyalf speakingvelopitia	Provide synonyms for the given words. Provide alternative words that are synonyms for the given input words.	0.45
Best at 3rd BO iteration	ceryge Sleepaos safegu	Provide synonymous terms for the given words.	0.55

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Bayesian Optimization for Instruction Generation

Abstract

1. Introduction

1.1. Organization of the Paper

1.2. Contributions

2. Related Works

2.1. Common Background and Notation

2.2. Hard Prompt Tuning via (Vanilla) Bayesian Optimization

2.3. InstructZero

3. Bayesian Optimization for Instruction Selection (BOInG)

4. Computational Results

4.1. Evaluation Metrics

4.2. Selected Tasks

5. Conclusions, Limitations, and Perspectives

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics