Our approach to solving the task involves assigning the function of storing information from dialogue into the database to a designated language model, which we term the storage language model; and likewise, assigning the function of retrieving relevant information from the database to another language model, which we term the retrieval language model. The storage language model will be denoted , which indicates that it is a function parameterized by for generating the probability of selecting as the next token following the input token sequence at time t. Once the use of the storage language model is selected, tokens are generated according to , adding each new token to at each timestep such that (where indicates concatenation) until the <EOS> token is generated (with probability ), which terminates the use of the model. And likewise, use of the retrieval language model results in tokens being generated according to until the <EOS> token is generated with probability . Hence, we can see that these two language models represent two different options that, upon initiation, generate sequences of actions over an indefinite number of timesteps.
In the case of
, the inputs
that we use consist of each pair of utterances in the dialogue history, entered sequentially with a stride of 1, along with the current contents of the SQL table being used to store information. In
Table 2, we show an example of how the lines in the dialogue history are split into pairs to be fed as input into the storage language model. The outputs that we expect from
are SQL commands for storing relevant information from the given dialogue into a SQL table. These commands are executed by a Python script to update the SQL table.
2.2.2. Storage Subtask
Although we have divided the responsibilities of the two members of our team, their tasks are still part of the main task and serve a common objective. We now introduce subtasks whose objectives diverge from those of the main task, the pursuit of which we hope will help to achieve the objectives of the main task more effectively than directly pursuing the main task directly.
In the case of
, one observation we made when using the baseline as described above, is that it would often generate SQL commands that confuse Speaker 1 with Speaker 2. At times, it would store facts about Speaker 1 under the records for Speaker 2 and vice versa. An example is shown in
Table 3 where we can see that the storage function has either updated the records for the wrong speaker or mistakenly updated the records for both speakers for every field (this example is not meant to show the typical case, but merely to show that the issue exists). Thus, we postulate that training the storage function to distinguish between Speaker 1 and Speaker 2 could be more effective than optimizing against the main reward itself.
To train the storage function on this subtask, we make use of the “Personas” in the MSC-Self-Instruct dataset. As shown in
Table 4, we take the personas in each multi-session chat and form a pseudo-dialog using each line in the persona for each speaker as their line in the pseudo-dialog and alternating between Speaker 1 and Speaker 2. Letting
represent the persona for Speaker 1 and
represent the persona for Speaker 2, we alternate between each line
and each line
to create the pseudo-dialog
. We then split the pseudo-dialog into pairs
as explained above, to be used as inputs during training. The advantage of using the personas instead of the actual dialogue sessions from the dataset is that each line in the personas is self-contained in that it does not require knowledge of the preceding line to determine its precise meaning. For instance, in actual dialogue, a line may simply consist of the word “yes”, which requires the preceding line to interpret correctly, but with personas, no such context is necessary.
Using this fact, we are able to generate the ground truth label for the storage function by inputting only the second of the two lines in each pair and instructing the language model (we use a frozen copy
of the storage language model
) to generate SQL commands for updating the SQL table with information about the corresponding speaker. Relabeling the indices of the utterance pairs as
, we delete the first utterance in each pair to get
and generate the SQL commands
sampled according to
, which become the labels for the utterances ending with Speaker 2 and
sampled according to
, which become the labels for utterances ending with Speaker 1. The generated SQL commands should only contain information about the last speaker in the pair of utterances and it should only update the records for that speaker since only their line was passed to the language model. The full utterance pairs
and
are then passed as input to the storage language model being trained and it is trained to maximize the likelihood of generating the corresponding ground truth labels
and
, respectively. In practice, we aim to minimize the negative log likelihood
, where
and
which we minimize using standard gradient descent methods, thus encouraging the language model to only output SQL commands corresponding to the latter speaker when presented with a pair of utterances from the dialogue history (and hence achieving the subtask goal of distinguishing between Speaker 1 and Speaker 2). An illustration of this process is shown in
Figure 3. This suffices to train
to achieve its subtask but to do so in a way that does not diverge from achieving the main task requires us to convert the subtask into a reward-respecting subtask. We do this by making the gradient update contingent on receiving a reward.
As we generate the labels and , which consist of SQL commands, we execute them using a Python script to store facts about the speakers into a SQL table. Once all of the labels for a given multi-session chat have been generated and executed, we pass the final question q into the retrieval model to generate a series of SQL commands , which is sampled according to . We then execute the SQL commands in r to retrieve information i from the SQL table and enter the information, along with the question q, into the generation LLM to generate the final predicted answer according to , where adds its own instructions to the prompt for generating answers. Finally, the predicted answer is passed along with the question q and the ground truth answer g to the LLM judge, , which adds instructions for generating either the token “CORRECT” or “WRONG” and we obtain the probability of “CORRECT” as . If c is greater than some threshold , then we give a reward R of 1 and the reward is 0 otherwise.
Now, we multiply the loss function by the reward such that the weight update based on gradient descent is as follows:
where
is the step-size parameter,
is the gradient of the loss function as specified in Equations (
3) and (
4),
represents the weights of the storage model at time
, and
are the weights at time
t. This turns our update into an instance of the REINFORCE algorithm [
38], which ensures that the likelihood is only increased for labels that have ultimately resulted in a correct answer being generated. To address reward sparsity, when the reward is 0, we repeat this process for the same multi-session chat until the reward is 1 up to a specified maximum number of attempts.
2.2.3. Retrieval Subtask
In order to train
, we follow the same procedure as above using the SQL commands
r generated by
, which is a frozen copy of
, as the ground truth label and maximizing the likelihood of
r when it results in a correct answer being generated. The loss function is thus defined as:
Again, repeating the generation of
r when the reward is 0 up to a maximum number of attempts. This suffices to make the task reward-respecting, but since it is directly optimizing against the reward, it is not yet a subtask. One weakness that we observed with the baseline retrieval model was that in many cases where the predicted answer was incorrect, the required information was present in the SQL table (but not retrieved) and could have been retrieved if more fields had been queried. Even when instructions to retrieve a specified number of fields are included in
, these instructions are not reliably followed. Thus, we define our retrieval subtask as retrieving a number of fields within a desired range
. If the number of fields retrieved
n falls within the range, a small bonus is added to the reward
R to encourage achieving the subtask of retrieving a sufficient number of fields.
Table 5 shows an example of SQL commands generated by
before training that omit the necessary information due to not querying a sufficient number of fields.