Next Article in Journal
EvoDevo: Past and Future of Continuum and Process Plant Morphology
Next Article in Special Issue
Facing Immersive “Post-Truth” in AIVR?
Previous Article in Journal
The Cognitive Philosophy of Communication
 
 
Article
Peer-Review Record

An AGI Modifying Its Utility Function in Violation of the Strong Orthogonality Thesis

Philosophies 2020, 5(4), 40; https://doi.org/10.3390/philosophies5040040
by James D. Miller 1,*, Roman Yampolskiy 2 and Olle Häggström 3
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Philosophies 2020, 5(4), 40; https://doi.org/10.3390/philosophies5040040
Submission received: 25 September 2020 / Revised: 10 November 2020 / Accepted: 20 November 2020 / Published: 1 December 2020
(This article belongs to the Special Issue The Perils of Artificial Intelligence)

Round 1

Reviewer 1 Report

This paper differentiates between the strong orthogonality thesis (in which most goals are compatible with most environments, for all intelligence levels) and the weak orthogonality thesis (in which most goals have environments which they are compatible with, for all intelligence levels). It argues that the strong orthogonality thesis is false, due to highly competitive environments in which agents face incentives to modify their goals.

The orthogonality thesis is of interest primarily due to arguments for and against AI risk. Many arguments for AI risk rely on a version of the orthogonality thesis, and some arguments against AI risk attack the orthogonality thesis. This paper may therefore be of interest to those discussions. However, it is my belief that the counterexamples discussed here have little or no weight in arguments against AI risk. I reason as follows:

If a highly competitive galactic environment leads to convergence of utility functions, and if the negotiating parties involved are at all successful in strategically modifying themselves to promote their original utility functions, then the final convergent utility function will include components which benefit each major player. It therefore matters a great deal what kind of "representative" humanity sends to this negotiation. If humanity creates a misaligned AGI, such as a paperclip maximizer, humanity essentially loses its seat at the negotiation table. Therefore, one would still expect a loss of the cosmic endowment.

Similar arguments can be made for a highly competitive environment on Earth (a multi-polar scenario). Each competing AGI is a seat at the bargaining table, and thus the alignment or misalignment of each agent can influence a significant portion of the cosmic endowment going forward. These multi-AGI scenarios may have other advantages and disadvantages (which have been discussed elsewhere), but the importance of aligning each superintelligence as much as possible remains.

I see nothing in the paper indicating that the authors misunderstand this point. However, my concern is that readers may easily misunderstand. This paper could be cited as a counterexample to orthogonality, in arguments against AI risk.

These are complicated issues, and it is easy for intelligent readers to miss such points.

I would therefore strongly suggest the authors attempt to address these points more fully and earlier on in the paper. I would especially suggest that the authors attempt to re-write the title and the abstract to minimize the chances of such misunderstandings. The authors may have better ideas of how to do this than I do. My current suggestion for the title is to use the term "weak orthogonality thesis" in place of "orthogonality thesis", signalling to the reader that the paper will differentiate between different forms of the orthogonality thesis.

As for the abstract, I would recommend adding a sentence explicitly addressing the relevance to AI risk (EG: "We conclude that this does not weaken arguments about the risks of AGI").

With respect to the opening story, I would personally emphasize that although the relative measure of paperclip maximization has dropped to close to zero, the absolute amount of resources "wasted" on paperclips can be quite large. So even in this optimistic scenario, there can be disutility of cosmic proportions due to the initial misalignment.

As an intuition pump, note that each self-modification of the original agent away from goal X would potentially be coupled with another agent modifying toward goal X, meaning that in some sense the total amount of caring-for-X is conserved. (This is of course not likely to be literally true.)

(As an aside, I think the story makes it sound too easy to achieve this kind of result. No mention is made of special measures which the programmers took to ensure transparency and prevent foom scenarios where the AGI becomes too powerful to care about these incentives. Why can't the AI mislead the humans rather than really modifying its utility function?)

So, to summarize, I would suggest making at least a few small changes to indicate that this does not call into question most arguments about AI risk.

The authors may, of course, disagree with me on this point, in which case I would encourage them to make some explicit indication of how they see their argument relating to AI risk.

Another concern I have with the overall thrust of the paper is that not enough attention is paid to the case where a highly competitive environment results in non-convergence rather than convergence. The paper emphasizes that there are a great many reasons why AGI could be incentivized to modify its utility function, including some highly counterintuitive scenarios such as changing its utility in order to make future adversaries help it when they try to hurt it. Why should all of these utility function modifications converge to one utility function? They could instead diversify utility functions, or cause utility functions to keep changing forever, or at least not result in full convergence. (It could even break the assumption that agents are convergently rational, since agents might self-modify their behavior to include many special cases and harm overall coherence.)

With respect to this concern, the section "Convergence to the Same Utility Function" does indicate a number of ways the convergence idea could fail. On the other hand, little reason is given for the convergence thesis. Yet the paper mentions convergence throughout, rather than non-convergence.

The paper uses language like "might converge" throughout, to hedge against possible non-convergence. But I would argue that this does not do enough to call attention to the possibility. The paper would be clearer if it instead made clear some assumptions under which hypercompetitiveness leads to convergence (EG, the assumption that strategic utility modification has a general tendency of pulling utility functions closer together), and then stated that if this thesis were true, then hypercompetitive environments would be a counterexample to strong convergence.

A third general concern I have has to do with alternate decision theories, for example updateless decision theory, timeless decision theory, and logical decision theory. The paper explicitly sets these ideas aside. But the thesis of the paper rests critically on this. Updateless decision theory, in particular, can (in some cases at least) get through these situations without self-modifying at all. It engages in the same behavior other decision theories would need to self-modify to achieve, but it does so without actually needing to self-modify. (This is, of course, the whole point of updateless decision theory.) Therefore, the (strong) orthogonality thesis is simply more true for updateless decision theory. Furthermore, recognizing this, agents would likely prefer to self-modify to updateless decision theory rather than change their utility functions. So this possibility undermines the whole argument (at least potentially).

However, these issues are complex, and I understand the desire to avoid them here.

== Three Conditions for Utility Function Self-Modification ==

It is not clear from the writing alone whether the conditions are intended to be conjunctive, disjunctive, or otherwise. The first condition is an if-and-only-if in itself; this covers precisely the cases wherein an AGI should modify its utility function. But the other two conditions aren't even jointly sufficient (I would at least add an additional assumption that the goals already differ, ie, the AGI is misaligned from the perspective of the other agents). So the three points to me look like a disorganized list of relevant factors, rather than a clear set of criteria.

== Pascal's Mugging ==

While interesting, this just seems like a special case of blackmail to me. If the AGI is self-modifying to not give into threats (as discussed in the blackmail section), this is already covered (probably -- as you mention, defining "threats" is an issue). I'm not saying you have to remove it, but if you have any difficulty with space constraints during revision, this sections seems especially expendable to me.

== Self Modification That Makes Past Version Worse Off ==

I just want to remark that this section makes an excellent point.

== AGIs in hyper-competitive environments & Convergence to the Same Utility Function ==

I think the argument here is broadly true, particularly when understood as a possibility rather than a certainty. However, one thing about the way the argument is framed is very confusing in my opinion.

The whole argument is predicated on AGIs in a "competitive" environment. Competitive appears to be construed as meaning anti-cooperative, EG, it is asserted that AGIs will not be able to coordinate to forestall this dynamic, because if some of them did, they could be undercut by those outside the agreement.

Yet the conclusion of the argument is highly cooperative. If all agents converge to the same utility function, then they would be able to forestall the race-to-the-bottom-dynamics which the argument is predicated on.

Therefore one might conclude that the argument is self-defeating: as agents converge to a single utility function, they become more able to coordinate to stop the race-to-the-bottom described.

I do not think this is the correct conclusion, but it seems to follow from the framing of the argument as presented.

I also want to re-emphasize the "conservation of preferences" idea that I briefly mentioned earlier. If all utility function modification was of the form several agents avoid clashing with each other by averaging their utility functions together, as in some of the examples, then in some sense the total caring for each goal is conserved. Although any individual agent does end up with almost none of its original values, those values are not broadly lost. This is somewhat contrary to the tone of the conclusion presented here. Of course, not all utility modification will take that exact form, so it is hard to say whether this kind of conservation law will approximately hold. However, it's an important question to call attention to, in my opinion.

Author Response

We now refer to the “strong orthogonality thesis” rather than just the “orthogonality thesis” multiple times in the paper including in the title.  We changed the motivating fictional story to better connect the paper with AI safety concerns.  We made the change to the abstract that you suggested.  We added a section on coalitions and conservation of preferences.  We stressed that non-convergence is a possibility in a highly competitive environment.  We did not include considerations of updateless decision theory.

In the Three Conditions for Utility Function Self-Modification we clarified that the first condition is necessary and sufficient but will likely only hold if the second two conditions hold.  We merged the Pascal’s Mugging section into the blackmail section.  We explained how the AGIs converging to the same utility function could potentially make cooperating easier or harder.

Reviewer 2 Report

This paper analyzes the possibility and the consequences of an AGI changing its utility function, especially in relation to the Orthogonality Thesis.

I think the paper is very clear in its presentation; point are well presented and explained (maybe with the only exception of paragraph 414-421 which I had a little bit of hard time following). I think the first part is well structured: motivating example, background, main theses (change of utility functions, conditions under which this happen). The second part maybe could be improved with a little bit more structuring, as it may feel like a listing of cases.

Content-wise, I have some doubts about the main thesis of the paper, but maybe I am mis-applying RL concepts. One of the assumptions behind the work is that an AGI may decide to modify its utility function if, so doing, would allow it to achieve a better outcome. This is stated in the first condition for utility self-modification (line 217): "The AGI expects that as measured by its old utility function it will do better if it changes its utility function". I find this statement sort of contradictory. Let L be the original loss function, and let be \pi* the optimal policy/behavior for L. Let us change L with a new loss L', which induces a new \pi** as optimal policy for L'. As by the requirement above we expect the change to undergo if "as measured by its old utility function it will do better if it changes its utility function", that is \pi*>\pi**. If so, why is that the agent did not come up with policy \pi** while still retaining L? To me it sounds like there is a conflation of utility/loss and behaviour/policy, but I would appreciate the authors to correct me if I am wrong.

Similarly, I think it might help to formalize the difference between final goals and instrumental goals. In the motivating story (line 27+) why promoting the programmers' aims get inserted in the final goal (with weight .5) and does not simply remain an instrumental goal?

I think that also the requirement of the utility function being observable is a little bit hard to accept. Once the AGI can edit its utility function, seeing its utility function at time t, may not help predicting action at time t+1, as the utility function may have changed already. I think it makes more sense to ask for the possibility of inferring the utility function (as the authors do discussing leakage, line 275+) - although inference, too, requires some stationarity.

At the end, I am not very convinced of the challenge posed to the Orthogonality Theory (as anticipated in line 210). In the conclusion, it seems again to me that AGIs would converge in behaviour/policy not necessarily in utility function. The last statement, that the utility function would be a function of intelligence is appealing, but I am not totally persuaded that what is a function of intelligence is the utility function and not the policy.

Author Response

We rewrote the paragraph that you had pointed out was unclear.  We expanded the section on observable utility functions in a way that hopefully addresses your concerns.  We addressed your concerns based on RL concepts in the section Three Conditions for Utility Function Self-Modification.  We formally defined instrumental drives in the initial story.

Reviewer 3 Report

I have read Bostrom's Superintelligence book and have engaged in discussion of superintelligence, but I am not particularly knowledgeable about this field in general.

That said, I found the paper interesting and publishable. It dives deeply into the conditions under which an artificial general intelligence (AGI) might alters its own utility function. This was discussed by Omohundro, but this paper goes into many more details and ramifications. I have only one suggestion:

Line 356. How could an AGI change its utility function to not give in to Pascal’s mugging threats? This is asserted, but not explained.

Author Response

We explained how an AGI could change its utility function to resist a Pascal’s mugging.
Round 2

Reviewer 2 Report

I thank the authors for addressing my concerns openly. I find the reworded paragraph (now lines 508-512) clearer, and the discussion on observability (lines 460+) more complete.

I appreciate the authors addressing my point on policy and utility explicitly in the text. I recognize that the case that they discuss in lines 288-292 (gun+neuroscanner) seems to work for their case, although I doubt wheter it is right to claim any generality from this. I guess in the Motivational Fictional Story the equivalent of the neuroscanner would be the reading of the utility function (and, if so, I would mark the parallel). It would be an interesting question in itself how possible or reliable these readings (neuroscanner, introspection, reverse engineering) could/would be.

The definition and example in the Introduction ("Instrumental drives are desires to accomplish intermediate goals that will help you better fulfill your final goals, e.g. capturing your opponent’s queen in chess so you have a better chance of eventually checkmating your opponent’s king") sounds still vague to me. In particular, it does not help deciding whether promoting the programmers' aims should be included as a (part of the) final goal and not as an instrumental one. I think dealing more rigorously with this point would add clarity, but I see it being not too relevant to the central argument itself.

Back to TopTop