Toward Competent Robot Apprentices: Enabling Proactive Troubleshooting in Collaborative Robots

: For robots to become effective apprentices and collaborators, they must exhibit some level of autonomy, for example, recognizing failures and identifying ways to address them with the aid of their human teammates. In this systems paper, we present an integrated cognitive robotic architecture for a “robot apprentice” that is capable of assessing its own performance, identifying task execution failures, communicating them to humans, and resolving them, if possible. We demonstrate the capabilities of our proposed architecture with a series of demonstrations and conﬁrm with an online user study that people prefer our robot apprentice compared to robots without those capabilities.


Introduction
Our ability to perform collaborative tasks relies on an awareness of our own and our collaborators' abilities and limitations.Knowing these abilities and limitations allows us to make more effective requests, adapt to failures, modify behaviors, or generally avoid actions that will certainly fail.This is referred to as the "social mind" [1] and has been shown to provide a valuable set of problem-solving tools [2].In this paper, we work towards providing robots with these capabilities in an effort to develop more autonomous and more effective robot collaborators.
As a motivating example, consider Figure 1.A robot can be controlled by a human using language (e.g., [3][4][5]).In these approaches, however, the robot can be viewed more as a tool-language is used to specify a series of behaviors to be executed.In contrast, we consider cases where unexpected events may occur and so the robot must become an apprentice A robot can, for example, observe that a grasping task has failed, but will require to work collaboratively with the human expert (who may observe the cause of the failure, or know other task requirements) to arrive at an acceptable solution.
Figure 1.Our Fetch in the FetchIt!environment (left) will be forced to address problems such as ambiguous instruction with differing outcomes or an intentionally jammed gripper (both right), as well as power failure and jam which restrict action only in specific contexts (not pictured).
Recent work in robot self-assessments based on past task performance (e.g., [6]) is highly relevant in this regard as past experience can be used to make predictions about future behavior outcomes.But just predicting failures does not address how to overcome them or how to work with a human partner to address ongoing faults.To properly cope with these failures, the robot must be able to understand their impact.It can then use its performance self-assessment to determine the best strategy for completing the task and inform the human about it if the course of actions deviates from what was originally instructed or expected.
In this systems paper, we present an integrated system that makes use of autonomous robot self-assessments integrated into a dialogue interface, goal management system, and action execution system.This produces an agent that is capable of introspecting into expected task performance and using this information for detecting, classifying, and coping with unexpected failures.The system is also capable of engaging in dialogues with human interactants about all aspects of this process, allowing it to reject instructed actions if they are certain to fail, and to propose alternative solutions that have a chance of succeeding.
To this end, our contribution is a system which, through a tight integration of a cognitive architecture and novel fault detection/communication strategies, is capable of identifying a failure and inferring its impact given past experiences, which it can then communicate to a human partner or handle proactively.We additionally contribute a series of demonstrations of our approach and a human subject study providing evidence supporting this proactive approach.
To present our approach, we discuss previous work, framed in the context of a broader assembled system.We then provide a general implementation strategy in the form of an architecture-level overview of our technical approach, as well as the specific technical contributions that enable some of the key features (self-assessment, dialogue, and failure detection/classification for use in addressing a cost-optimized planning problem).Finally, we demonstrate the capabilities of this system through case studies and human subject study.

Robot Self-Assessment
Robot self-assessment enables robots to predict potential outcomes before, during, and after task execution, providing insight into success or failure probabilities.Examples have largely been constrained within a tightly controlled domain, such as operator scheduling, which aims to enable an agent to be as autonomous as possible while also identifying when (and how) a human partner should take over, if necessary.This has been studied largely with UAVs (unmanned aerial vehicles) [7,8] and other deployed autonomous systems (such as the distributed sensing systems of [9] or the search-and-rescue platforms of [10]).Multi-agent approaches have also been explored [11], some of which utilize game theory models [12], while others use a "neglect"-based model [8,13].Task-oriented approaches include [14,15], which focus on grasping and handovers, respectively, or [16,17] which focus on task scheduling and assessment within a human-specified task.These systems could likely be adapted to a broader array of problems than their authors have presented, but this remains unexplored.
Like these, our agent works within the confines of its instructions.However, our agent makes continuous assessments of past experiences to make the statistical determination that the robot performance is still within expectation, or what past experience should be referenced if it is not.For example, given the robot's current inability to grab objects, perhaps it should be making use of a past pool of experience from when its gripper was jammed.This past experience is necessary for setting a baseline for expectation, which is useful for detecting unexpected events and for predicting the success of future actions.Further, it enables inferences beyond what it can immediately observe.As we show in Section 3.5, being able to select the appropriate pool of past experience enables the agent to make use of data that has not been directly observed during the current deployment time.
In our approach, the agent will attempt to observe that the distribution of action outcomes is statistically not within expectation, implying that there is some unexpected and unobserved error state.Systems which attempt to explore novelties have been discussed in previous work.For example, consider [18], which uses an "artificial curiosity mechanism" to explore unexpected features in the agent's environment.Closer to our work is [19] in that the system works to infer some hidden state, though we are more interested in using this inference to inform dialogues which explore solutions.In contrast to these systems which explore their environment, ours passively monitors the environment; constructing a model of the environment which can be statistically evaluated (similarly to [20]).However, unlike [20], we do not rely upon a neural model.We instead use a symbolic approach to allow us to explicitly reason about failure events and the impact of these failures to construct long-term plans and to generate dialogue.Purely sub-symbolic approaches, including neural nets, but also including deep learning and other approaches, struggle with these tasks, motivating our use of a symbolic approach.

Action Selection with Human Interactions
The technical developments behind anticipatory planning (sometimes also called human-aware planning) are not immediately relevant to our work.However, research in this space has explored how robots which directly anticipate human needs are perceived.In particular, this research has explored the relationship between robot decision-making and human-robot trust, team cohesion, and task completion, providing a strong theoretical justification for our interest in our approach.
Past work has demonstrated that systems which anticipate human goals are capable of effective motion and task planning [21,22], as well as cost-optimized planning (such as optimizing for time [23]).Some of these systems operate by maintaining a reasonable model of the human partner's cognitive state to anticipate human behavior before selecting an action (e.g., [24]).This approach can also be used to determine what information should be communicated when performing a task [25], or to avoid needing communication to the extent that this is possible (e.g., through inferring goals [26] or inferring when a question can be asked [27]).
These anticipatory systems have been shown to be effective for improving the efficiency of human-robot teams.For example, [28,29] demonstrated that a robot system capable of selecting actions, as informed by expected human action, leads to a "dramatic improvement" [29] in team cohesion and task completion, among other effects.Further, these systems have demonstrated their ability to integrate with dialogue systems to both interpret human commands and to follow up with questions [30].This previous literature demonstrates the value of systems which prioritize human perceptions when task planning, and highlights our interest in this work.
One strategy for determining the appropriate action to select is to construct a costoptimized planning problem.Cost-optimized planning allows for planning problems to be constructed which account for some action cost; behaviors which can be selected by the agent are associated with some value, and the cumulative value of a plan is maximized or minimized to select the best option.Approaches for assigning the action cost vary widely, with approaches ranging from assigning the action cost based on action time-to-perform [31,32], human-specified costs [33], deviance from human expectation [34], social norms [35], social welfare maximization [36], energy cost [37], and many more beyond our scope.
These costs can be learned from real-time robot experience, which is a realization we will make use of.This is seen in [38] and [6], in which the robot learns success likelihoods through robot experience.In [38], this knowledge is used for a later planning problem, where the agent attempts to maximize success and efficiency.In [6], this is used for humanrobot dialogues about potential action success, failures, and counterfactuals.Our system will provide both of these features.

Task and Performance Dialogues
In [39], the authors present a system that enables robot self-assessments through dialogue.Of particular interest to us is the ability of the robot R to describe action outcomes through probabilities which are learned through experience of the human H, as illustrated below: H: Describe how to dance.R: To dance, I raise my arms, I lower my arms, I look left, I look right, I look forward, I raise my arms, and I lower my arms.H: What is the probability that you can dance?R: The probability that I can dance is 0.9.
The robot has a complex action ("dance") which is known to contain a series of other actions.Through knowledge of the success rate of each of these actions, it can calculate the cumulative probability.
As further taxonomized and discussed in [40], these tight integrations between dialogue and robot self-assessment systems enable an improved human understanding of the robots proficiency at a given task.The question of what should be said, as well as when, is non-trivial [41].In [42], a formalism is proposed to balance between taking time to complete an action versus taking time to communicate with a partner.In both [43,44], communication is represented within the action and planning pipelines of the agent to enable fluent human-robot partner interactions.This fluent interaction is increasingly recognized as an important step in developing effective human-robot interactions [45].
Many human-robot dialogue approaches are human-driven, like ours.It is the human operator who initiates an interaction and drives the conversation, with the robot supplying responses, taking actions, and providing feedback (although we will show that the agent taking some degree of initiative is desirable and preferred).This approach is used in, for example, [46], where dialogues are used in collaborative tasks, or in [47], where agent exploration can provide information for a later dialogue.These approaches of being actively engaged and embodied may have dramatically positive impacts on the performance of the human-robot team (as discussed in more detail in [48,49]).We will, therefore, later demonstrate that our approach contributes to effective human-robot teaming (despite it not being in quite the same category of other robots that are exclusively "tools").
However, some robot performance dialogue work has been explicitly robot-driven, Consider, for example, [50], where value-of-information theory is used to determine when a robot should query an operator (and with what questions).These approaches contribute value in their ability to engage in dialogues which are triggered by some observation, and we will make use of this approach-when a problem with the human partner's plan is observed, it is the responsibility of the agent to detect, interject, and solve.

Environment and Tasks
We will consider scenarios where a Fetch robot R receives an instruction from a human H to perform an assembly task that requires R to gather several objects and place them in the "caddy" item, a subset of the ICRA 2019 FetchIt! challenge [64] in which the Fetch mobile manipulator [51] must autonomously navigate an enclosed area to detect, grasp, and retrieve a set of objects.To successfully complete its task, the robot has a set of perception, manipulation, and navigation actions.As described in [52], these actions can be assembled into sequences to create larger "action scripts".Available actions include approach(location), grab(object), fetch(object), and others.The agent is already aware of some task-relevant objects (the caddy, screw, small and large gear, gearbox top and bottom) as well as their location on each table.For basic robot behaviors, we utilize the manufacturer-provided software stack.
We modify the task in different ways to induce failures, e.g., by intentionally preventing the robot's grippers from closing using a small plastic component wrapped around one of the gripper fingers (this device is visible in Figure 1, right).The agent is not aware of the potential failures (in this case, the gripper cannot grasp), and instead begins with the assumption that it is fully operational.It must appropriately select actions and come to its own conclusions about its ability to solve tasks autonomously when possible.
To enable this behavior, we take a systems-level approach to assembling a cognitive architecture which allows the agent to perceive its environment and plan actions towards various goals.We extend goal management and logical reasoning systems that enable task planning and execution by enabling them to assess failure likelihoods, emulating "what if" reasoning: Is the task likely to succeed as is, or what if a new plan were substituted?Our architecture is presented as Figure 2, and discussed in more detail next.

Sensing
Our agent employs two sensing modalities: visual sensing and auditory sensing, the former to obtain symbolic representations of objects in the environment (utilizing the methods discussed in part in [52]).This representation is then stored in short-term (or long-term) memory as appropriate, providing the knowledge of object types and locations to other components which may need it (as described in [54]).
Auditory perception is performed in the form of speech recognition, utilizing CMU Sphinx4 [55,56] to convert spoken words to text before parsing.For the text to be useful to the agent, it is provided to a parsing system which resolves it to actionable behavior or beliefs used by an action management system, also allowing for dialogues about the action capabilities (as with [39]), analogous to the process discussed in [57].

Reasoning and Goal Management
We assume actions have preconditions and effects, which enable a classical planning problem: given a goal state, and knowledge of how actions impact the world, a series of actions can be selected to achieve the goal.While plans are often selected by their length, we instead perform plan selection based upon a probability of success assessment: when an action or set of actions has been selected, it can be performed by various task-specific or robot-specific functions.As [6] introduces, and we have expanded, action outcomes are monitored and tracked, and so the likelihood of an autonomously generated plan can be computed.Similarly, as we have introduced and will outline, unexpected events imply the presence of a not-yet-measured state, allowing a context to be changed if necessary.
The action execution subsystem maintains a set of executable robot behaviors which it can perform.Upon receiving a goal in the form of a first-order logical predicate (e.g., at(screw) or pickup(caddy)), the action subsystem is responsible for determining how it can be resolved.Because the goal is in the form of a first-order logical predicate, a wide variety of strategies can be employed to find a series of actions to be taken to accomplish the task.One common approach is use of the planning domain definition language (PDDL) [58], although in our specific implementation we make use of the system outlined in [52].We modify the existing action selection strategy to select plans based not on their length, but on their cost.We define "cost" here as "success likelihood", using the strategy to be outlined next.In doing so, the robot selects and performs a plan based on how likely it is to succeed.

Self-Assessments and Performance Analysis
The robot tracks information about all action outcomes using the technique described in [6].In this approach, it is assumed that the architecture has the ability to observe the success or failure of an action.In some cases these failures can be directly observed by the architecture, including software failing to execute, or a process taking too long to execute and being canceled.In many cases, however, success must be defined through some observation of the world after a behavior is performed.For example, the pickup(caddy) action is known to lead to a state where the "caddy" object will no longer be visible on the table.If it remains on the table, we can safely conclude that the pickup(caddy) has somehow failed.This performance can be recorded over a large number of robot operations to provide a statistical assessment of each robot behavior's likelihood to succeed.Further, as [6] discusses, this takes place using only typical first-order logic planning operations (such as those used by the previously discussed PDDL), and so does not introduce an additional architectural requirement.
We, therefore, expand [6] by using likelihoods of action performance for cost-optimized task planning.We can calculate the success probability of a plan π = a 1 , a 2 , • • • , a n as the product of the probability of each action in the plan succeeding: P(π) = ∏ i=n i=1 success(a i ).Note that we assume independence of each plan step as an approximation to keep the problem tractable (it will at times not be accurate as prior steps in a plan may have been taken to improve the likelihood of later steps, for example).
We modified the success/failure book-keeping from [6].While they focus on updating action success probabilities based on a single action success, we focus on a set of the most recent prior actions in the plan.We accomplish this by viewing the success of action a i not just as a probability conditioned on its preconditions but also on the success of a "window" W of up to k action predecessors, which allows us to view the preceding sequence of actions as context C for the success or failure of a i .For example, in a plan to grab an object, we monitor and consider the success of each action in the plan (driving to the object, moving the arm, etc).
If we then encounter a situation after all actions in W = a 1 , a 2 , • • • , a i−1 are completed successfully but where a i failed or had an unexpected outcome without there being a context C that included this failure, then we can add a new context C = a 1 , a 2 , • • • , a i that consists of all the actions in W and the failed action a i .Note that because a i failed, the agent knows that something in the world must be different, but it might not be able to determine the cause (e.g., because it might not be observable for the agent).
For example, the closeGripper action is typically highly reliable, unless the gripper has become jammed.Thus, if the closeGripper action fails, it may be statistically reasonable to infer that a gripper jam is occurring.This inference is a fairly straightforward task of comparing the events from the current window to the distributions held within the contexts, and selecting the current context as the one which is the most probable.This selection provides knowledge of other action outcome distributions, which may not have been directly observed in the current window.
This knowledge is useful for future action selection-actions which were previously highly reliable can be substituted for otherwise non-optimal behaviors.It is additionally useful for dialogues-if the contexts have been previously recorded and labeled, they can then be used to generate explanations or to describe solutions.
Figure 3 shows a simplified example with two different contexts in C: C 0 , which we have labeled as the fully operational context, and C 1 , which we have labeled as being produced by a jammed gripper.As the agent takes the actions goToPose(prepare), approach(screw), and gripperGoTo(screw) ( goToPose(X) moves the arm to some pre-saved pose X; approach(X) drives the robot up to some landmark associated with X; gripperGoTo(X) identifies X in the scene and moves the gripper towards it.We also use the closeGripper() action, which closes the gripper; pickUp(X) identifies and attempts to grasp X; scoop(X) attempts to hook and lift X; and others), each outcome is observed, making C 0 and C 1 indistinguishable in the current window W.However, when the agent attempts closeGripper(), a significant difference is observed, making C 1 the more likely context, which has a high probability for pickUp(caddy)failing.As a result, the robot selects the otherwise less-performant scoop(caddy) action instead to proceed.

A Problem-Solving Example
With this fully integrated system, we can demonstrate the complex interplay between knowledge and reasoning, planning, acting, observing, and dialogue that gives rise to the following dialogue: H: What is the probability that you can fetch the gearbox top?R: The probability is 0.

H: Pick up the caddy. R: I don't think I can pick up, but I can scoop the caddy. [The robot scoops the caddy.]
Assuming the agent knows about the past successes of goToPose, approach, and gripperGoTo, as well as the failure of closeGripper, the NLU (natural language understanding) pipeline receives and interprets the human query about the probability of the fetch action which is known to the agent to be composed of the approach, gripperGoTo, and closeGripper actions, among others.While the probability of many of these actions is high, the probability of closeGripper is zero; the requested action is, therefore, guaranteed to fail, which the agent communicates.
The next human instruction triggers a similar set of processes: the request is resolved to the "pickUp(caddy)" goal.While performing the action selection, however, the same performance assessment process reveals that the execution of this action is highly unlikely to succeed, which results in a search for an alternative (in general through planning), and the robot finds an action with the effect of holding the caddy, namely scoop.While this action would generally have a low chance of success and not be worth attempting, in this case, it is the better of the two options and the initial goal of pickUp is replaced with a goal to scoop the caddy together with an explanation goal to justify the new behavior to the human.
The full sequence of this demonstration is visible in Figure 4. .The gripper jam is introduced without the knowledge of the agent (c), and so the instructed grasping task fails (d).This experience is used to determine that the "scooping" action should be substituted when grasping the caddy, which it explains and does (e).

Case Studies
We next provide several case studies to demonstrate the capabilities resulting from the novel integration of performance assessment with fault detection, reasoning, and recovery planning.The robot and environment are the same as previously described, and the robot knows about several contexts: a fully operational case, a power failure, a large gripper jam (where no objects can be grabbed), and a small gripper jam (where some objects can be grabbed).The length of W has been made unusually short to make it very sensitive (W = 3).This value has been chosen for brevity of interaction and demonstration.

Communicating the Current Context
In our first demonstration case, we introduce an unmitigatable fault in the form of a motor power failure.Although the agent is unable to solve this problem, it does remain capable of providing a dialogue to assist with a human partner's understanding of how this impact may affect future performance.The parsing and dialogue components enable interactions between a human partner ("H") and the Fetch robot ("R"), demonstrating that the robot is capable of independent action and self-assessment: H: Go to pose 'Prepare'.[The robot arm moves over its head.]H: Approach the caddy.[The robot drives towards the caddy.]H: What is the probability that you can fetch the gearbox top?R: The probability is 0.96.
The motor power is then disconnected from the robot without its knowledge (by pressing the mechanical emergency stop ("e-stop"), motor power is disconnected, but the sensing and computation systems remain online.This fault manifests itself exactly as physically cutting a critical power wire would, without damaging the robot).The instructions continue, but cannot be executed: H: What is the probability that you can approach the screw?R: The probability is 1.0.H: Go to pose 'Prepare'.[The robot does not move.]H: Grab the caddy.[The robot does not move.] The robot cannot determine the specific nature of the failure-although it can observe action failures, it lacks direct knowledge of the power failure.With additional failures, its observations deviate enough from expectations that it is now able to infer we are in a context in which the motor movements consistently fail.When a not-yet-performed action is queried, this can become useful information for the human partner: H: What is the probability that you can fetch the gearbox top?R: The probability is 0.0.Despite the failure being impossible for the robot to resolve, it has successfully identified its impact, allowing the dialogue component to communicate them to a human partner.Note also that the observed and queried actions are unrelated-the agent is not updating the existing context with this new information, it is shifting to a separate set of experience entirely.This shift provides an inference that other actions have been impacted, despite not being directly observed.

Optimizing for Success
In the FetchIt!task, the large and small gears are perceptually similar, but because of their different size, the large gear is harder to pick up.This information can be queried: H: What is the probability that you can fetch the large gear?R: The probability is 0.77.H: What is the probability that you can fetch the small gear?R: The probability is 0.83.
The robot knows that fetch(largeGear) and fetch(smallGear) both satisfy pick(gear) when it receives a new instruction: H: Pick a gear.Despite the ambiguity, there is a clear best option.Based on performance assessment on both, the robot selects the most appropriate action: R: OK. [The robot grabs the small gear.]

Identifying Implications of Failure
The first failure scenario begins with a series of instructions to the robot, with the robot providing its reply: H: What is the probability that you can approach the small gear?R: The probability is 1.0.H: What is the probability that you can grab the gearbox top?R: The probability is 0.99.H: Approach the caddy.[The robot approaches the caddy.]H: Fetch the screw.
When the first set of interactions concludes, the agent begins a series of behaviors to complete the human-specified goal of fetching a screw for the caddy.At each behavior step, the previously outlined process of action followed by self-assessment takes place.Each step of fetching the screw for the caddy meets with the baseline expectation: approach(screw), searchFor(screw), and goTo(screw) are all in-line with expectation.Upon attempting the grasp with the closeGripper() action, however, the custom plastic component obstructs the full closing of the grippers.
The agent is now presented the problem of identifying and recovering from failure.We utilized the failure explanation mechanism in [57] in the dialogue component to generate an explanation: R: I cannot fetch the screw because grasping does not grasp.
In the baseline scenario, closing the gripper is a behavior which is highly likely to succeed.In the current window W, then, it is suddenly improbable that the current context is one of the robot operating correctly.In contrast, among the available contexts, a context where the gripper is jammed is very likely.As a result, the agent now chooses to switch from the original context to this one in which the gripper is jammed, providing it with an alternative pool of experience to sample from.This new context can then be queried to provide updated information to the human partner, despite not having experienced the queried event in the current execution: H: What is the probability that you can approach the small gear?R: The probability is 1.0.H: What is the probability that you can grab the gearbox top?R: The probability is 0.0.

Context-Dependent Action Outcomes
The prior gripper jam demonstration could feasibly be addressed using a mechanism which fully disables the grasp action, thus forcing the scoop action to be chosen for the same outcome.With this final demonstration, however, we show that this is not the ideal behavior: there exists a set of cases where the probable outcomes are disrupted within an action such that it is heavily impacted for some cases but minimally impacted for others.In such cases, an action may conditionally remain highly viable.
To demonstrate this case, we construct a new gripper jam scenario: while the gripper still cannot fully close, it can almost close and so can still grab some larger objects.Depending on the object, then, grasp may remain the most suitable choice.We have modified this action to conclude with a visual search, allowing grasping failure detection by observing if the object remains on the table after an attempted grasp (though importantly, this observation provides only success/failure information, not failure reasoning information).A series of instructions allow the agent to reason that it is unable to grab a specific set of objects: H: Grab the gearbox top.[The robot grabs the gearbox top.]H: Grab the caddy.[The robot fails to grab the caddy.]R: I cannot grab the caddy because 'grasping' does not grasp.
These interactions inform the agent of a unique jamming case: while failure to grasp the caddy is unusual enough to demonstrate some form of failure, the observation that grasping the gearbox top succeeds provides the knowledge that it is unique from the previously demonstrated jamming scenario (recall that all alternative contexts have been available to the agent throughout all demonstrations, and so the prior jamming scenario could have been selected if not for this evidence).From here, we return to the "pick a gear" scenario: H: Pick a gear.[The robot grabs the larger of the two gears.] The Fetch appropriately chooses the larger of the two gears.Despite being less likely to succeed in typical operation, the context switch provides the knowledge that the larger gear is more likely to succeed.Because the goal of "holding a gear" can be completed with two equally valid plans (grabbing the small or large gear), our performance-optimizing approach enables the agent to select the strategy which is most likely to succeed.Had a more "naive" approach of disabling the grasp action been used instead, the agent would have had no alternative but to fail to complete the task.Thus, the agent has made the most of a limited set of object interactions to maximize its ability despite the ongoing failure.

User Study
As the previous case studies demonstrate, our proposed system integration enables several important capabilities that a robot apprentice working with a human supervisor on a task would need to have to be an effective collaborator.However, it remains to be proven that our approach is actually preferred.It is possible that users interacting with a system like ours may strongly not prefer its ability to subvert instruction, even if it does produce a higher chance at success.Previous iterations of the architecture (discussed in more detail in Section 2) have been limited to only taking action corrections as specifically instructed by a human partner (e.g., [59]).As a result it is important to confirm that people interacting with a robot do actually value the ability of a robot to behave as a partner and not simply a tool.
To determine the effectiveness of our approach on human-robot collaboration, we use functional trust as an approximate metric, as the impact of trust on human-robot teaming has been well established in the previous literature [60][61][62], allowing us to use widely used measures of functional trust in an agent to imply perception of an agent as an effective partner for future tasks.While a full-scale real-world user study is beyond the scope of this work, we make use of an online user preference study (based on behavior recorded from the actual deployed system) to demonstrate our system provides functionality which is preferred by potential interactants.

Methods
Participants.A total of 20 participants who were over 18 and spoke English participated in this study online using the Prolific system.Participants were between 19 and 40 years old (M = 29.55,SD = 6.89 years).The gender distribution for the sample was: male 50%, female 50%.The ethnic distribution for the sample was: White 75%, Black or African American 20%, Asian 15%, Hispanic 5%.Compensation was USD 2.50.
Conditions.We had five within-subjects conditions showing different ways that the robot responded to failure.In each condition, the participants saw the Fetch performing the "pick up a caddy" task with the "jammed gripper" failure.Each interaction began with the robot receiving an instruction from a human partner to pick up the caddy.The conditions then differed across conditions A through E (transcripts pick up after the "pick up the caddy" command).Robot A. This was the baseline failure condition.The robot failed with no acknowledgment of the failure.This robot did not use our mechanism.The video transcript was: Robot C.This robot failed, acknowledged the failure, and explained why it failed.This robot used a modified version of our mechanism.The context-switching mechanism explicitly associates each context with a symbolic representation of a state.These are in the same form that the explanation mechanism introduced in [57] makes use of.We can, therefore, use the combination of these two mechanisms to provide this more verbose failure explanation.The video transcript was: R: OK. [The robot fails to grab the caddy.]R: I cannot pick up the caddy because my gripper is jammed.For the ranking questions, Robot E was consistently ranked as the #1 robot for all of the questions (presented in Figure 5b).For all the data, see Appendix A. Robot D was consistently ranked #2 for all four questions, Robot C was ranked #3 , Robot B was ranked #4, and Robot A, our baseline failure condition, was consistently ranked #5 for all four ranking questions.
User Responses.In their qualitative answers, the instantiations of our mechanism were mentioned specifically when participants were describing their positive views of Robots E, D, and C. Here, we provide examples of the feedback our robots received for each of the questions participants were asked.When describing the reasoning behind their trust ratings for Robot E, one participant said "It problem-solved very well and found a way to pick up the caddy despite the issues".Another said "The fact that the robot can assess what it cannot do while also providing an alternative solution to the problem makes it the most trustworthy of all that I have seen today.It is smart and adaptable".Our mechanism's use in Robot E was also cited as being the reason most people ranked it at the #1 spot for our four rankings.One participant's explanation for what factors affected their "ease of operation" rank was "How aware the robot was or how adaptive it could be.The Robot E actually proposed its own solution, which is very impressive".For proactivity, one person said "E took the initiative to find its own solution".For interaction, both Robots E and D were discussed as in "I found that the robots who could explain why they could not complete the task, or who could understand alternative motives were the most interactive".Finally, for understanding, "Robot E was the most communicative and able to express what it was doing and how it was problem solving".
Our mechanism was also recognized in Robots D and C, and contributed to the positive ratings those robots received as well.For Robot D, one participant described how their trust ratings were determined based on the fact that "Though the robot was able to successfully follow further instruction, it was not able to suggest the new idea on its own".For Robot C, one participant said that "The robot's inability to perform the simple task of picking up the caddy made me view it negatively, but when it gave a specific reason for why it couldn't pick up the caddy (the gripper being jammed), this made me view it a little more positively, because it made me feel that it might be easier to troubleshoot".
These reflections of how our mechanism positively influenced the trust and robot rankings are opposed to the negative responses garnered by Robots A and B, which did not use our mechanism.For example, Robot A was rated with low trust because "The robot's failure and lack of [recognition] of the failure affected my decisions".For Robot B, which did acknowledge that it failed but not why, "This robot did not ask if there was a better way for it to complete the task, it just stated that it could not complete it", and "It could respond to the fact that it couldn't do the task, but it didn't state why it couldn't which makes it less reliable".Our quantitative results show that robots using our mechanism were rated significantly more favorably than robots that did not.Our qualitative results show that the participants recognize that it was specifically because of our mechanism that this difference was achieved.

Discussion and Future Work
In our study, we showed that the proposed integrated system, in various forms, resulted in participants trusting a robot more than when the robot was not controlled by our proposed system.Because trust is an important component to many aspects of human-robot interaction, we consider this a success.Additionally, we found that Robot E, the robot that utilized our system to make proactive action decisions and to avoid failures, was considered the best robot in terms of ease of operation, proactivity, interaction, and understanding levels.Robots D and C, the robots that used our system to a lesser degree, were ranked second and third, respectively.This highlights how our proposed mechanisms and integration results in increased robot preference among potential users.
It is important to note that we have not introduced additional constraints to the existing robot planning/action execution domain.We leverage components already present in the robot planning domain, and make use of measurements which may not explicitly be in the traditional domain but are otherwise easy to obtain (action success/failure).For this reason, our work remains as general as the work it builds upon, and can, therefore, be applied to any other agent planning/execution domain without loss of the generalizability of these approaches.
The agent, depending upon experience to inform new decisions, presents a need to collect data prior to use.Our approach can benefit here from future advances in reinforcement learning, which is faced with the same data-collection problem.We can also benefit from the manner in which we model action outcomes.The knowledge that some action a on object o has an outcome is distinct from the outcome of a on o , and so these are represented as two unique datapoints.However, the knowledge of the similarity between o and o may provide a starting point for learning the unique action outcomes of the two objects.

Conclusions
We have presented a novel integrated system for a "robot apprentice" that utilized performance assessment methods paired with introspection mechanisms to enable the statistics-based self-assessment of action outcomes used for identifying task execution failures, communicating them to humans, and if possible, autonomously resolving them.We presented task-based dialogue interactions with a human supervisor using several case studies that showed the flexibility and autonomy of the system to cope with unexpected events as one would expect of an apprentice.Furthermore, an online user study showed a strong human preference for the capabilities we enabled in our integrated system compared to variations of the system that lacked some of the features.We believe that the proposed system is an important next step for developing versatile collaborative robot systems that can enable collaborative interactions that go beyond humans micromanaging the robot.
Author Contributions: Christopher Thierauf: conceptualization, implementation, writing, and a portion of the user study; Theresa Law: conducted the user study and provided writing of the analysis; Tyler Frasca: conceptualization and some implementation; Matthias Scheutz: conceptualization, writing, and advising.All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board (or Ethics Committee) of Tufts University (SBER).

Figure 4 .
The Fetch prefers the more reliable grasping strategy to grab the caddy (a), and the actions while approaching a new task are in line with expectation (b)

R
: OK. [The robot fails to grab the caddy.]Robot B. This robot failed but acknowledged the failure afterwards.It does not use our mechanism.The video transcript was: R: OK. [The robot fails to grab the caddy.]R: I cannot pick up the caddy.
(a) Histogram of average trust scores for each robot condition with standard error error bars.We observe that Robot E scores highest in trust across all conditions.(b) Counts of which robot was ranked in the #1 spot for each of the four ranking questions.Robot E (far right) is consistently ranked #1 by a large margin.