Our approach to solving the task involves assigning the function of storing information from dialogue into the database to a designated language model, which we term the storage language model; and likewise, assigning the function of retrieving relevant information from the database to another language model, which we term the retrieval language model. The storage language model will be denoted , which indicates that it is a function parameterized by  for generating the probability of selecting  as the next token following the input token sequence  at time t. Once the use of the storage language model is selected, tokens are generated according to , adding each new token  to  at each timestep such that  (where  indicates concatenation) until the <EOS> token is generated (with probability ), which terminates the use of the model. And likewise, use of the retrieval language model results in tokens being generated according to  until the <EOS> token is generated with probability . Hence, we can see that these two language models represent two different options that, upon initiation, generate sequences of actions over an indefinite number of timesteps.
In the case of 
, the inputs 
 that we use consist of each pair of utterances in the dialogue history, entered sequentially with a stride of 1, along with the current contents of the SQL table being used to store information. In 
Table 2, we show an example of how the lines in the dialogue history are split into pairs to be fed as input into the storage language model. The outputs that we expect from 
 are SQL commands for storing relevant information from the given dialogue into a SQL table. These commands are executed by a Python script to update the SQL table.
  2.2.2. Storage Subtask
Although we have divided the responsibilities of the two members of our team, their tasks are still part of the main task and serve a common objective. We now introduce subtasks whose objectives diverge from those of the main task, the pursuit of which we hope will help to achieve the objectives of the main task more effectively than directly pursuing the main task directly.
In the case of 
, one observation we made when using the baseline as described above, is that it would often generate SQL commands that confuse Speaker 1 with Speaker 2. At times, it would store facts about Speaker 1 under the records for Speaker 2 and vice versa. An example is shown in 
Table 3 where we can see that the storage function has either updated the records for the wrong speaker or mistakenly updated the records for both speakers for every field (this example is not meant to show the typical case, but merely to show that the issue exists). Thus, we postulate that training the storage function to distinguish between Speaker 1 and Speaker 2 could be more effective than optimizing against the main reward itself.
To train the storage function on this subtask, we make use of the “Personas” in the MSC-Self-Instruct dataset. As shown in 
Table 4, we take the personas in each multi-session chat and form a pseudo-dialog using each line in the persona for each speaker as their line in the pseudo-dialog and alternating between Speaker 1 and Speaker 2. Letting 
 represent the persona for Speaker 1 and 
 represent the persona for Speaker 2, we alternate between each line 
 and each line 
 to create the pseudo-dialog 
. We then split the pseudo-dialog into pairs 
 as explained above, to be used as inputs during training. The advantage of using the personas instead of the actual dialogue sessions from the dataset is that each line in the personas is self-contained in that it does not require knowledge of the preceding line to determine its precise meaning. For instance, in actual dialogue, a line may simply consist of the word “yes”, which requires the preceding line to interpret correctly, but with personas, no such context is necessary.
Using this fact, we are able to generate the ground truth label for the storage function by inputting only the second of the two lines in each pair and instructing the language model (we use a frozen copy 
 of the storage language model 
) to generate SQL commands for updating the SQL table with information about the corresponding speaker. Relabeling the indices of the utterance pairs as 
, we delete the first utterance in each pair to get 
 and generate the SQL commands 
 sampled according to 
, which become the labels for the utterances ending with Speaker 2 and 
 sampled according to 
, which become the labels for utterances ending with Speaker 1. The generated SQL commands should only contain information about the last speaker in the pair of utterances and it should only update the records for that speaker since only their line was passed to the language model. The full utterance pairs 
 and 
 are then passed as input to the storage language model being trained and it is trained to maximize the likelihood of generating the corresponding ground truth labels 
 and 
, respectively. In practice, we aim to minimize the negative log likelihood 
, where
          and
          which we minimize using standard gradient descent methods, thus encouraging the language model to only output SQL commands corresponding to the latter speaker when presented with a pair of utterances from the dialogue history (and hence achieving the subtask goal of distinguishing between Speaker 1 and Speaker 2). An illustration of this process is shown in 
Figure 3. This suffices to train 
 to achieve its subtask but to do so in a way that does not diverge from achieving the main task requires us to convert the subtask into a reward-respecting subtask. We do this by making the gradient update contingent on receiving a reward.
As we generate the labels  and , which consist of SQL commands, we execute them using a Python script to store facts about the speakers into a SQL table. Once all of the labels for a given multi-session chat have been generated and executed, we pass the final question q into the retrieval model  to generate a series of SQL commands , which is sampled according to . We then execute the SQL commands in r to retrieve information i from the SQL table and enter the information, along with the question q, into the generation LLM  to generate the final predicted answer  according to , where  adds its own instructions  to the prompt for generating answers. Finally, the predicted answer  is passed along with the question q and the ground truth answer g to the LLM judge, , which adds instructions  for generating either the token “CORRECT” or “WRONG” and we obtain the probability of “CORRECT” as . If c is greater than some threshold , then we give a reward R of 1 and the reward is 0 otherwise.
Now, we multiply the loss function by the reward such that the weight update based on gradient descent is as follows: 
          where 
 is the step-size parameter, 
 is the gradient of the loss function as specified in Equations (
3) and (
4), 
 represents the weights of the storage model at time 
, and 
 are the weights at time 
t. This turns our update into an instance of the REINFORCE algorithm [
38], which ensures that the likelihood is only increased for labels that have ultimately resulted in a correct answer being generated. To address reward sparsity, when the reward is 0, we repeat this process for the same multi-session chat until the reward is 1 up to a specified maximum number of attempts.
  2.2.3. Retrieval Subtask
In order to train 
, we follow the same procedure as above using the SQL commands 
r generated by 
, which is a frozen copy of 
, as the ground truth label and maximizing the likelihood of 
r when it results in a correct answer being generated. The loss function is thus defined as:
Again, repeating the generation of 
r when the reward is 0 up to a maximum number of attempts. This suffices to make the task reward-respecting, but since it is directly optimizing against the reward, it is not yet a subtask. One weakness that we observed with the baseline retrieval model was that in many cases where the predicted answer was incorrect, the required information was present in the SQL table (but not retrieved) and could have been retrieved if more fields had been queried. Even when instructions to retrieve a specified number of fields are included in 
, these instructions are not reliably followed. Thus, we define our retrieval subtask as retrieving a number of fields within a desired range 
. If the number of fields retrieved 
n falls within the range, a small bonus is added to the reward 
R to encourage achieving the subtask of retrieving a sufficient number of fields. 
Table 5 shows an example of SQL commands generated by 
 before training that omit the necessary information due to not querying a sufficient number of fields.