You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

19 December 2024

Bayesian Optimization for Instruction Generation

,
,
,
and
1
Department of Computer Science Systems and Communication, University of Milano-Bicocca, 20126 Milano, Italy
2
Department of Economics Management and Statistics, University of Milano-Bicocca, 20126 Milano, Italy
3
OAKS s.r.l., 20125 Milan, Italy
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Advances in Large Language Models: Techniques, Applications and Challenges

Abstract

The performance of Large Language Models (LLMs) strongly depends on the selection of the best instructions for different downstream tasks, especially in the case of black-box LLMs. This study introduces BOInG (Bayesian Optimization for Instruction Generation), a method leveraging Bayesian Optimization (BO) to efficiently generate instructions while addressing the combinatorial nature of instruction search. Over the last decade, BO has emerged as a highly effective optimization method in various domains due to its flexibility and sample efficiency. At its core, BOInG employs Bayesian search in a low-dimensional continuous space, projecting solutions into a high-dimensional token embedding space to retrieve discrete tokens. These tokens act as seeds for the generation of human-readable, task-relevant instructions. Experimental results demonstrate that BOInG achieves comparable or superior performance to state-of-the-art methods, such as InstructZero and Instinct, with substantially lower resource requirements while also enabling the use of both white-box and black-box models. This approach offers both theoretical and practical benefits without requiring specialized hardware.

1. Introduction

Large Language Models (LLMs) have triggered an amazing amount of innovation in several domains of the machine learning arena. In this paper, we focus on the design of experiments and black-box optimization, which have not been extensively explored to date, despite the integration of LLMs and optimization, which present many opportunities. Foundational language models can be game changers in optimization, leveraging the enormous amount of information available in free-form text an entirely new approach to optimization task comprehension, exploiting wider contexts across new tasks and generalizing pre-trained models over unseen search spaces.
The considerations outlined in the Introduction were inspired by [1], which argues the potential of foundational models for enhancing black-box optimization and advocates for the adoption of transformers and LLMs to achieve this goal.
A general approach is proposed in [2] that introduces LLMs as optimizers, describing the optimization task through natural language. In each optimization step, the LLM generates new solutions from the prompts that contain the previously generated solutions with their values. The first problems considered are linear regression and the travelling salesman problem, with prompt optimization aimed at finding instructions that maximize the task-specific objective function. In [3], it is demonstrated that, using textual representations of mathematical values, LLMs can act as universal regressors. The proposed method, namely OmniPred, can take as input dynamically varying input spaces and does not require normalization. LLMs can also be used in the framework of evolutionary optimization [4] for single-objective and multi-objective evolutionary optimization [5,6].
Another computational framework for exploiting LLMs in black-box optimization is Bayesian Optimization (BO) [7]. The potential of the transformer architecture for Bayesian inference was represented in [8] with respect to In-Context Learning (ICL). More recently, in [9], it was shown how to frame the basic BO algorithm [10,11] in natural language terms, enabling LLMs to sequentially propose promising solutions conditioned on previous trials and observations. The proposed approach, namely LLAMBO (Large Language Model for Bayesian Optimization), addresses two critical problems: enhancing, through LLMs, the key components of BO, including the surrogate model and the acquisition function, and leveraging modules of different processes in the BO pipeline in natural language terms. In the basic BO algorithm and most of its extensions, the Gaussian Process (GP) [12,13] is the most common choice for the probabilistic surrogate model. Other methods have been attracting increasing attention. It is well known that neural networks are universal approximators, as the GP has key advantages of an analytical formula and a principled estimate of the uncertainty. On the other hand, neural networks, especially Bayesian Neural Networks (BNNs), have been also considered due to their flexibility in handling high-dimensional optimization problems. Moreover, the availability of computing power has brought to the fore the use of Monte Carlo methods for estimating uncertainty. Finally, transformers can provide another surrogate model with an advantage over the GP of integrating naturally contextual understanding, few-shot learning, and domain knowledge.
The relation between LLMs and BO is two-way: along with the role of LLMs in enhancing Bayesian optimization, another line is to exploit BO to improve prompt/instruction engineering to enable LLMs to solve a specific task. The focus in this paper is on the second objective, analyzing recent BO methods and proposing a new and less resource-consuming approach.

1.1. Organization of the Paper

Section 2 provides a synthetic reference to the main approaches for prompt/instruction optimization. Section 3 details the approach proposed in this paper, namely BOInG (Bayesian Optimization for Instruction Generation) (Figure 1). Section 4 provides the experimental settings and the results. Section 5 provides concluding remarks on, limitations of, and perspectives of BOInG and, more generally, BO in working with LLMs.
Figure 1. High-level BOInG instruction generation workflow and components.

1.2. Contributions

Form the methodological point of view, a new strategy is proposed to deal with the combinatorial nature of the problem. Instead, of performing BO in low-dimensional continuous space, then trying to recast the solution to the closest possible text, a penalty is included in the BO acquisition function to push the search for promising solutions towards low-dimensional representations of texts known to the LLMs.
As most of current approaches, BOInG works with two LLMs, with the first used as an instructor generator and the second used as a solver for a specific task. While other state-of-the art approaches–discussed in the following subsection “related works”–require that at least one of the two LLMs is a white-box model, in BOInG, both the LLMs can be black-box models, leading to a significant reduction in computational resource use “on premises”. BOInG overcomes these limitations by leveraging two black-box LLMs, currently implemented with GPT-3.5 but extensible to more advanced closed-source models like GPT-4o. This is particularly advantageous, as the highest-performing models are often closed-source, while open-source alternatives typically offer lower quality and require larger architectures to achieve comparable results, further increasing computational demands and necessitating multi-GPU setups. BOInG accesses these models via an API in an “as-a-service” paradigm, only requiring the embedding of a white-box LLM (specifically, GPT-2) for penalty computation to apply to the acquisition function. By eliminating the need to run the full LLM locally, BOInG substantially reduces computational requirements while maintaining the optimization paradigm.
Figure 2 summarizes the costs of BOInG against those of two state-of-the-art methods, namely InstructZero [14] and Instinct [15], as described in the following.
Figure 2. Comparison of computational resources and costs across three BO-based instruction generation approaches. Information is related to the GPU type, parameter count, memory needed, FLOPS, hourly cost, and cloud provider for BOInG, InstructZero, and Instinct. It is evident that BOInG uses significantly fewer parameters and requires significantly less memory.

3. Bayesian Optimization for Instruction Selection (BOInG)

In this section, we detail our proposed approach, BOInG. First, we introduce some useful notations:
M g : The LLM working as instruction generator. It receives a hard prompt and a small set of examples ( X ¯ , Y ¯ D ) as input and produces a text representing an instruction.
M s : The LLM working as solver for a certain task. It receives an instruction and a large set of input examples, that I, X from X , Y D , as input and provides its own predictions ( Y ^ Y ).
w W τ : A hard prompt consisting of τ tokens given as input to M g , with W , a vocabulary, preferably the one which M g has been trained on.
E = e i i = 1 W : The set of the token embeddings such that e i = e n c o d e W i and W i = d e c o d e e i , where W i represents the i -th token in the vocabulary ( W ).
p P R d : A soft prompt in a convenient low-dimensional search space.
We consider the same workflow as in InstructZero [14] and Instinct [15], that is, M g works as an instruction generator; then, M s works as a solver for a certain task by using the instruction generated by M g . The workflow starts by injecting a hard prompt ( w W τ ) into M g , along with a small set of examples ( X ¯ , Y ¯ D ). The generated instruction, denoted by M g w X ¯ , Y ¯ , is given as input to M s , along with a large set of input examples ( X ) whose associated output ( Y ) must be predicted. The symbol   denotes the concatenation operator.
The final aim is to efficiently search for
w argmax w W τ L w ; M g , M s , X ¯ , Y ¯ , X , Y
with the loss function defined as follows:
L w ; M g , M s , X ¯ , Y ¯ , X , Y = 1 N i = 1 N 1 y i = M s M g w X ¯ , Y ¯ x i
where 1 a = b denotes the indicator function (equal to 1 if and only if a = b and 0 otherwise) and M s M g w X ¯ , Y ¯ x i is the output provided by the second LLM ( M s ), given the input ( x i ) and the generated instruction ( M g w X ¯ , Y ¯ ). The entire optimization process is summarized in Figure 5.
Figure 5. Flow chart of a prompt optimization loop, starting with a hard prompt and small example set. The process involves instruction generation via closed-source LLMs ( M g ), prompt loss calculation using the M g ’s generated output with as input for the generated instruction and training set, and Bayesian optimization. This cyclical workflow illustrates the generation of instructions in the BOInG algorithm.
Solving problem (6) is difficult due to the combinatorial nature of the search space, that is, W τ . As already addressed in the recent literature, our idea is to, instead, use a soft prompt ( p P R d × τ ), where τ is the number of tokens and d is the dimensionality of the latent representation of each token. The most natural and suitable choice for the latent representation of the tokens should be M g ’s embedding space ( Ω R q ), which, unfortunately, is usually high-dimensional, i.e., q d . Accounting also for τ , the final search space is Ω τ R q × τ ; on the contrary, we want to identify a conveniently low-dimensional search space ( P R d × τ ) to perform BO.
As suggested in the recent literature, we use a random projection matrix ( A R d × q ), with entries sampled from a normal or uniform distribution to project any soft prompt ( p ) from our conveniently low-dimensional search space ( P ) into an associated point in the high-dimensional search space ( Ω τ ). Thus, each d -dimensional token ( p i ) of the prompt ( p ) is projected into an embedding ( A p i ). Random projection is a quite common procedure known to be distance-preserving. However, the projection of the soft prompt is still not a hard prompt, which is, instead, needed as an input of the instruction generator ( M g ).
It is important to remark that the embeddings of the tokens used to train M g all lie within the high-dimensional space ( Ω ) and are denoted as set E = e i i = 1 W , where e i = e n c o d e W i and W i = d e c o d e e i . Thus, the most naïve strategy is to recast every projection ( A p i ) into the closest e j E , then retrieve the associated token ( W j ). This allows for the retrieval of a hard prompt ( w W τ ) given the random projection of a given soft prompt ( p P ) the computation of the associated loss. Although this is a possible procedure, we later find that it can be largely ineffective.
At a generic iteration ( n ) of BO, the trials ( p 1 , , p n ) and the associated losses ( L 1 , , L n ) are used to train a GP approximating a back-box and expensive loss function with respect to soft prompts ( p ) in the low-dimensional space ( P ). Selection of the next prompt to try ( p n + 1 ) is performed by optimizing an acquisition function balancing between exploration and exploitation. However, due to random projection induced by A , we are not sure that the projected point ( A p ) is “consistent” with the (embeddings of the) tokens known to M g . In simpler terms, in a very high-dimensional space ( Ω ), the projected point ( A p i ) could be far away from the embedding of any token known to M g , leading the LLM to generate incomprehensible and strange instructions that are difficult for M s to successively interpret.
Indeed, we introduce a penalty function ( π p ) so that UCB is optimized while keeping all the projections ( A p i ) associated with the soft prompt ( p P ) as close as possible to the embeddings of the tokens known to M g . Our penalty function is defined as follows:
π p = 1 τ i = 1 τ min e E A p i e
In more simple terms, π p is the average distance between each projection ( A p 1 , , A p τ ) from the closest embedded tokens known to M g . Finally, p n + 1 is obtained as follows:
p n + 1 , w n + 1 argmax p P μ p + β σ p C π p
where C is a regularization hyperparameter and β manages the exploration–exploitation trade-off. Indeed, the usual UCB is penalized by the quantity expressed as C π p .
It is important to remark that solving the penalized UCB requires computation of π p ; thus, we obtain p n + 1 , along with the associated indices ( j i = argmin j = 1 , , W A p i e j ) for each i = 1 , , τ , from which we directly obtain the associated hard prompt ( w n + 1 ) such that w i n + 1 = d e c o d e e j i = W j i . This is crucial: we are searching in the conveniently low-dimensional search space ( P ) but close to known embeddings, which is completely different form optimizing UCB “freely”, then recasting p n + 1 to the closest embeddings a posteriori. In the second case, UCB could lead to a p n + 1 so far away from all the known tokens (due to the high dimensionality of Ω τ ) that recasting could be completely incoherent.
Finally, knowing the embeddings of the tokens known to M g is the best option for the proposed approach; however, this requires M g to be a white-box model. To mitigate this request and give the chance to, instead, use powerful black-box LLMs, we decided to only adopt the embeddings of another white-box LLM–specifically, GPT2 in our case–under the assumption that the distances between embeddings should be coherent across different LLMs.
GPT-3.5 Turbo is used in BOInG as an LLM. For each task, we use the following parameter settings: 5 and 20 samples from the training and validation sets, respectively. The number of tokens in every soft prompt is 5. The elements of the random projection matrix are sampled from a uniform distribution in [ 1,1 ] . The value of d is set to 10, and 25 soft prompts are explored for each iteration. Finally, β = 1 and C = 1 .
We utilized an evolutionary search algorithm, namely “SampleReducingMCAcquisitionFunction”, to find the top 25 soft prompts. All training and testing were performed on a 4-core machine equipped with an NVIDIA T4 GPU, which accelerated the matrix calculations necessary for determining the penalty (obtaining the embedding distances). The BOInG algorithm is summarized in the following (Algorithm 2).
Algorithm 2: BOInG (Bayesian Optimization for Instruction Generation)

Input: Examples X ¯ , Y ¯ , validation set X , Y , instruction generator LLM M g , solver LLM M s , maximal steps T , the dimensionality d of the search space, number of tokens τ, vocabulary W , embeddings set E   =   e i i = 1 W   , random projection matrix A     R d ×   q .

Initialize:
     P S o b o l i n i t i a l i z a t i o n (n = 10)
     V ; L
    for p     P do
      for i = 1 to τ do
         j i a r g m i n j = 1 , , W A p i e j
         w i d e c o d e e j i = W j i
         v { M g w [ i ] X ¯ , Y ¯ }
         V     V v
         L     L 1 / N i = 1 N 1 y i = M s v x i
      end for
    end for

while L T do
    train a GP model using ( P , L )
     p , w argmax p P μ p + β σ p C π p //Penalized UCB
     v M g w X ¯ , Y ¯ //Generate instruction using Instruction Generator
     L L 1 / N i = 1 N 1 y i = M s v n x i //Evaluate instruction
     P P p ,
     V V v ,
end while

Output:
    The best instruction V i with i argmax i = 1 : L L i

Where:
     π p = 1 / τ i = 1 τ min e E A p i e //Penalty function

4. Computational Results

4.1. Evaluation Metrics

Different score functions were considered depending on the specific task. In detail, we used the following four evaluation metrics:
Exact Match (EM): The correct response yields E M = 1 and E M = 0 otherwise. The associated tasks are “Letters list” and “Second word letter”.
Exact Set (ES): When evaluating each question-and-answer pair, E S = 1   if the predicted response precisely matches the correct responses set and E S = 0 otherwise. The associated task is “Taxonomy”.
Contain: This measure has a value of Contain = 1 in the case that the predicted answer is part of the correct responses and Contain = 0 otherwise. The associated tasks are “Word Sorting” and “Synonyms”.
F1: The usual F1 score calculated by comparing individual words in the predicted response to those in the actual or true Answer. The common words between the predicted and actual answers form the basis for the F1 score. Precision is determined by the proportion of common words to the total words in the predicted response, while recall is calculated as the proportion of common words to the total words in the actual answer. The associated task is “Common”.

4.2. Selected Tasks

The tasks were selected from [14]; due to budgetary constraints, we selected a subset of 6 among the original 24 tasks to reduce the required testing and implementation time. The selected tasks were chosen to be representative of different levels of difficulty (based on the scores reported for APE, Uniform, and InstructZero).
Easy tasks:
Letter list: Provide the input string, separating each character with a space.
Word sorting: With a list as input, the task is to produce an output that is the list in alphabetical order.
Challenging tasks:
Taxonomy: Create a program that generates a list of animals based on the provided input.
Synonyms: Provide the synonyms of the input word.
Second word letter: Taking a string as input, return the second character.
Common: Extract the relationship between the input words.
Table 1 summarizes the results obtained by BOInG on the selected tasks and compares them against three other methods. BOInG’s test score is in line with other BO-based approaches (i.e., Instinct and InstructZero) and outperforms them on two tasks.
Table 1. Test scores comparing BOInG against APE, Instinct, and InstructZero (* for APE and Instinct results from [15]; ** for InstructZero results from [14]).
An important consideration is the significantly different costs entailed by the three methods. BOInG requires the invocation of M g via an API, implying a minimal additional cost of approximately USD 0.01 per 20′270 tokens (using GPT-3.5-turbo-0125, ANI Technologies Pvt.Ltd., Bengaluru, India). This cost is significantly lower than that required to run white-box LLMs. For instance, the GPU time required for InstructZero (using an NVIDIA A6000, NVIDIA, Santa Clara, CA, USA) would cost approximately USD 0.01 for just 25 s, during which only a few local LLM calls could be executed.
Finally, we provide some examples of generated instructions and associated hard prompts over the BO process. In Table 2, the “hard prompt” column displays the hard prompt prepended to the input given to the instruction generator LLM ( M g ), consisting of seemingly random character strings that represent the algorithm’s exploration in the prompt space. The Instruction column shows the human-readable, task-relevant instructions generated by M g based on these hard prompts and task examples. The “score (train)” column indicates the performance of each instruction, ranging from 0 to 1, with higher scores denoting better performance. The table captures the optimization process across six rows, including initialization and subsequent iterations. Initially, two different hard prompts yield similar instructions with identical scores of 0.40. The third iteration shows a marginal improvement, achieving a score of 0.45. The fourth row demonstrates the best performance, reaching a score of 0.55 with the instruction, “Provide synonymous terms for the given words”. This table showcases the algorithm’s progression in identifying increasingly effective instructions for the synonym task.
Table 2. The presented table illustrates the iterative process of instruction optimization for a synonyms task (no further improvement after the 3rd BO iteration).

5. Conclusions, Limitations, and Perspectives

BOInG demonstrates the versatility and efficiency of Bayesian Optimization (BO) for black-box problems in tackling the highly structured combinatorial challenges of prompt tuning and instruction generation in LLMs.
By leveraging an “as-a-service” API paradigm and treating both models as black boxes, BOInG avoids the need for the white-box instruction generation model’s architecture, offering greater flexibility. This approach enables BOInG to utilize high-performing, closed-source models like GPT-4o or Claude 3.5 Sonnet without requiring significant local computational resources. Unlike methods relying on open-source LLMs that necessitate dedicated hardware such as GPUs, BOInG only requires penalty loss computations using the GPT-2 embedding layer, which is orders of magnitude smaller than the LLMs used in other approaches. Experimental results confirm BOInG’s competitive performance compared to state-of-the-art methods like InstructZero and Instinct, with substantially lower resource requirements.
While effective, BOInG relies on embeddings from GPT-2 for penalty function calculations, which may introduce coherence challenges due to the mismatch between its embedding layer and that of the primary LLM. Addressing this limitation through experimentation with alternative embeddings or model-specific penalty functions could improve compatibility and performance.
Despite the successes of BO, its limitations include challenges in handling categorical variables and problems with many variables or images, where gaussian processes (GPs) tend to underperform. For such cases, Bayesian neural networks have been suggested as promising alternatives, offering the ability to flexibly represent non-stationary behavior and handle multi-output objectives. Replacing GPs with Bayesian neural networks could enhance the robustness of BOInG in these scenarios.
An entirely different perspective is to generalize the capabilities of LLMs beyond natural language tasks, generating candidate solutions for the BO process with limited data in contexts requiring generalization from few examples.

Author Contributions

Conceptualization, F.A., I.G. and A.C.; Methodology, F.A., A.P. and A.C.; Software, A.S. and A.P.; Validation, A.P.; Investigation, I.G.; Resources, I.G.; Data curation, A.S.; Writing—original draft, F.A. and A.C.; Writing—review & editing, A.C.; Visualization, A.S.; Supervision, I.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in public repositories from previous papers reporting the associated urls. InstructZero (https://github.com/Lichang-Chen/InstructZero (accessed on 6 December 2024); Instinct (https://github.com/xqlin98/INSTINCT (accessed on 6 December 2024)).

Conflicts of Interest

Authors Andrea Ponti and Ilaria Giordani were employed by the company OAKS s.r.l. The remaining authors declare that the re-search was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Song, X.; Tian, Y.; Lange, R.T.; Lee, C.; Tang, Y.; Chen, Y. Position Paper: Leveraging Foundational Models for Black-Box Optimization: Benefits, challenges, and Future Directions. arXiv 2024, arXiv:2405.03547. [Google Scholar]
  2. Liu, S.; Chen, C.; Qu, X.; Tang, K.; Ong, Y.S. Large language models as evolutionary optimizers. arXiv 2023, arXiv:2310.19046. [Google Scholar]
  3. Song, X.; Li, O.; Lee, C.; Peng, D.; Perel, S.; Chen, Y. Omnipred: Language models as universal regressors. arXiv 2024, arXiv:2402.14547. [Google Scholar]
  4. Lange, R.T.; Tian, Y.; Tang, Y. Evolution Transformer: In-Context Evolutionary Optimization. arXiv 2024, arXiv:2403.02985. [Google Scholar]
  5. Liu, F.; Lin, X.; Wang, Z.; Yao, S.; Tong, X.; Yuan, M.; Zhang, Q. Large language model for multi-objective evolutionary optimization. arXiv 2023, arXiv:2310.12541. [Google Scholar]
  6. Liu, F.; Tong, X.; Yuan, M.; Zhang, Q. Algorithm evolution using large language model. arXiv 2023, arXiv:2311.15249. [Google Scholar]
  7. Li, J.; Song, F.; Jin, Y.; Qiang, W.; Zheng, C.; Sun, F.; Xiong, H. BayesPrompt: Prompting Large-Scale Pre-Trained Language Models on Few-shot Inference via Debiased Domain Abstraction. arXiv 2024, arXiv:2401.14166. [Google Scholar]
  8. Müller, S.; Hollmann, N.; Arango, S.P.; Grabocka, J.; Hutter, F. Transformers can do bayesian inference. arXiv 2021, arXiv:2112.10510. [Google Scholar]
  9. Liu, T.; Astorga, N.; Seedat, N.; van der Schaar, M. Large language models to enhance bayesian optimization. arXiv 2024, arXiv:2402.03921. [Google Scholar]
  10. Archetti, F.; Candelieri, A. Bayesian Optimization and Data Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2019; Volume 849. [Google Scholar]
  11. Garnett, R. Bayesian Optimization; Cambridge University Press: Cambridge, UK, 2023. [Google Scholar]
  12. Williams, C.K.; Rasmussen, C.E. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006; Volume 2, p. 4. [Google Scholar]
  13. Gramacy, R.B. Surrogates: Gaussian Process Modeling, Design, and Optimization for the Applied Sciences; Chapman and Hall/CRC: Boca Raton, FL, USA, 2020. [Google Scholar]
  14. Chen, L.; Chen, J.; Goldstein, T.; Huang, H.; Zhou, T. InstructZero: Efficient instruction optimization for black-box large language models. arXiv 2023, arXiv:2306.03082. [Google Scholar]
  15. Lin, X.; Wu, Z.; Dai, Z.; Hu, W.; Shu, Y.; Ng, S.K.; Jaillet, P.; Low, B.K.H. Use your instinct: Instruction optimization using neural bandits coupled with transformers. arXiv 2023, arXiv:2310.02905. [Google Scholar]
  16. Sabbatella, A.; Ponti, A.; Giordani, I.; Candelieri, A.; Archetti, F. Prompt Optimization in Large Language Models. Mathematics 2024, 12, 929. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.