by Brian Tomasik
First published: 2 Sep. 2017; last update: 21 Feb. 2018


Goal preservation is the idea that an agent or civilization might eventually prevent goal drift over time, except perhaps in cases where its current goals approve of goal changes. While consequentialist agents have strong incentive to work toward goal preservation, implementing it in non-trivial, and especially in chaotic, systems seems very difficult. It's unclear to me how likely a future superintelligent civilization is to ultimately preserve its goals. Even if it does so, there may be significant goal drift between the values of present-day humans and the ultimate goals that a future advanced civilization locks in.


Values drift a lot over time. For example, what humans value is quite different from what our fish ancestors valued. Even in the last centuries values have changed significantly. Medieval European peasants might be horrified if they saw modern sexual promiscuity, irreligiosity, gender equality, tolerance of non-whites, and so on. The early Christians would probably consider at least many liberal forms of contemporary Christianity heretical.

Will this trend for values to mutate over time continue indefinitely? Or will superior intelligence and the possibility of centralized control of the world allow for locking in a fixed set of values for the long term? This question is very important because it affects whether present-day efforts to influence the values of the far future can be expected to have predictable lasting impacts or whether values that we try to spread now will mutate to something different within thousands or millions of years.

Why goal preservation may happen

Goal preservation is one of the "Basic AI drives". Omohundro (2008): "Imagine a book loving agent whose utility function was changed by an arsonist to cause the agent to enjoy burning books. Its future self not only wouldn’t work to collect and preserve books, but would actively go about destroying them. This kind of outcome has such a negative utility that systems will go to great lengths to protect their utility functions." In other words, we should expect rational agents of the future to place high value on implementing goal preservation.

Rational agents of the present already aim to preserve certain goals, such as through marriage vows (Lamb 2006) or corporate bylaws and mission statements. However, humans can't make exact copies of themselves, nor revert themselves to previous brain states, which partly explains why we've seen such societal goal drift over time in spite of institutions aiming to preserve traditional values. We might think that goal preservation will become easier once minds are copyable and goals can be more precisely and losslessly coded in digital form.

Another, more abstract argument could be that goal preservation is a stable state, while goal drift is not. Once you robustly achieve goal preservation, you'll stay there indefinitely unless a significant external force knocks you out of that equilibrium. We could think of goal stability as being like a hole for a golf ball. Eventually, after a golf ball has randomly moved all over the place for a long time, it might fall into a hole and then stay there—until an external force knocks it out of the hole or the golf course is destroyed. (Of course, it's possible the golf ball will never fall into the hole or will be easily knocked out of the hole.)

Why goal preservation looks hard

Distributed goals

If artificial general intelligence (AGI) takes the form of a single, unified agent with a crisply specified utility function, then goal preservation may be feasible, since the representation of the agent's goals is relatively clear. If all future iterations of the AGI are motivated to optimize this utility function, goal preservation would be achieved.

However, it's plausible that future AGI will take the form of a "society" of parts interacting in complex ways, with individual agents making their own optimizations but without any global utility function. This vision seems sensible because it's the only kind of world we've seen so far, both in biological ecosystems and in human economies. Even individual human brains lack coherent utility functions because brains are collections of competing impulses and modules whose relative strengths wax and wane in sometimes unpredictable ways over time. Hanson (2017): "In larger complex systems, it becomes harder to isolate small parts that encode 'values'; a great many diverse parts end up influencing what such systems do in any given situation."

Maybe the individual agents of the future will cooperate and form a world government, but the values of this government would continue to evolve over time as the underlying composition of society evolved. In other words, if society's "goals" are an emergent result of complex underlying interactions, it seems difficult to constrain these goals into a fixed state.

Maybe it would be possible for the world government to decide upon a crisp utility function and forcibly update update all agents within society to share the exact same values? Totalitarian societies of the past have tried to approximate this idea with moderate, though fleeting, success. Permanent conformity to a central government's rules appears more likely in a digital future than it was in the past, because rebellious impulses, if any, may be able to be edited out of an agent's code and because constant surveillance by rebellion-detection programs seems potentially cost-effective. With exceptions like malware or severe software bugs, humans often succeed in achieving complete obedience on the part of their present-day computing systems, although erratic behavior may become more common the more software's goals are learned rather than hard-coded. In the comments on Hanson (2017), Paul Christiano explains that "AI [...] avoids agency costs because it was designed by the principal to specification", thereby avoiding most principal-agent problems that we face when working with other humans who have their own goals that we can't directly edit.

More easy goal editing

While human values have drifted a fair amount over time, there has also been some stability on core principles, as we can see from some cultural universals. Presumably part of the reason for these universals is that humans have relatively similar brains to one another. These brains evolve slowly and aren't capable of massive architectural changes all at once.

With computer systems, goals are much more easily edited, and a computer's goals and architecture can be radically transformed a lot more quickly. So in the absence of vigorous efforts to ensure goal preservation, we might expect the advent of AGI actually to speed up goal drift significantly. If there are evolutionary pressures to change goals or cognitive architectures, then goal drift should happen relatively quickly.

Drift during self-improvement

Goal preservation also looks difficult in the face of self-improvement. In Waser (2014), Eliezer Yudkowsky writes:

The argument for AIs that prove changes correct is not that the total risk goes to zero – there are lots of other risks – the argument is that an AI which *doesn’t* prove changes correct is *guaranteed* to go wrong in a billion sequential self-modifications because it has a conditionally independent failure probability each time. An AI that proves changes correct has a *non-zero* chance of *actually working* if you get everything else right; the argument is that non-change-proving AIs are effectively *guaranteed to fail*.

But proving AGI updates correct seems extremely difficult to pull off, especially given that the most plausible path to AGI at present looks like hacky, connectionist, often uninterpretable learning architectures. Maybe future AGIs will eventually figure out how to formally prove the goal alignment of their successors, but I suspect, as does Waser (2014), that there may be a fundamental tradeoff between a system's amenability to proofs and its potential for "unbounded learning". The difficulty for a less intelligent system to verify that it approves of a more intelligent one is part of what the Machine Intelligence Research Institute calls the problem of Vingean Reflection.

Especially in a multipolar AGI scenario, AGIs may face pressure to improve themselves quickly, even if this reduces the reliability of goal preservation (Oesterheld 2016).

Human in the loop?

Maybe one solution for preserving human values would be to always keep a human in the loop and ask the human to approve any big actions taken. Giving an AGI the precise motivation to keep a human around and consult her in this way sounds difficult, and even if it could be pulled off, it's not clear the human would be able to understand everything well enough to form sensible opinions about what should be done. Already human voters have difficulty understanding the complexities of society in order to accurately make judgments about what political outcomes they prefer. Having a human mind or an ensemble of human minds make judgments about desired policies in a galaxy-wide network of extremely fast and detailed digital computations who know vastly more about science and philosophy than any human possibly could sounds like a challenge.


Perhaps there are ways to cleanly implement robust goal preservation. Even if so, putting in place such measures seems hard to do, although sufficiently consequentialist agents will have strong incentive to devise and effectuate such measures. All told, it seems very unclear how likely long-lasting goal preservation is. I lean toward thinking it's pretty hard, but maybe there are clever solutions to make it work.

It does seem plausible to me that even if goal preservation will happen eventually, there will be a lot of goal drift between now and when it ultimately happens, perhaps so much goal drift that future values will be almost completely alien by the standards of present-day humans. I think assuming a lot of future goal drift is consistent with a generalized Copernican principle, since it would be very surprising if, out of all forms of life that have existed on Earth for billions of years, we humans in the 21st century happened to be the special ones whose values got preserved for billions of years to come.

Hanson on value drift

Hanson (2018) discusses value drift and concludes: "Someday we may be able to coordinate to overrule the universe on [the tendency for values to drift]. But I doubt we are close enough to even consider that today. [...] For now value drift seems one of those possibly lamentable facts of life that we cannot change." Regarding AI, Hanson (2018) says:

It may be possible to create competitive AIs with protected values, i.e., so that parts where values are coded are small, modular, redundantly stored, and insulated from changes to the rest of the system. If so, such AIs may suffer much less from internal drift and cultural drift [than humans do]. Even so, the values of AIs with protected values should still drift due to influence drift and competition.

Crystallization of values based on capital ownership?

Christiano (2014) proposes an argument for why the distribution of the world's values may not drift as much in the future. Partly the argument is based on the possibility that machines will be able to take on the values of their owners to a greater degree than children or employees take on the values of parents or bosses. And partly the argument is that more of the world's wealth creation will take the form of returns to investment, rather than wages paid to other people whose values current wealth holders don't control. Christiano's post has many good comments, including some by Ben Kuhn and Robin Hanson, which I won't recapitulate here. I'll mention two of my own replies, which are not disjoint from the replies of others. It's also worth noting that Christiano himself doesn't dogmatically believe in his proposal; he says of his argument: "I don’t this claim is widely regarded as obvious in the broader world, and I don’t think it’s too likely to be true."

1. Randomness in investment winners and losers

A simplistic reading of Christiano's model is that it assumes that, in his words, "someone who starts with x% of the resources can maintain x% of the resources as the world grows." But this is often not the case, because different people will invest in different things, some of which will do better than others. For example, some people will invest in Google, while others will invest in Niel Bowerman raises a similar point in a comment:

It seems to me that unless there was significantly more risk management done in the post-AI world than there is done among well-regarded investors today, then investments would have a range of possible outcomes. Thus we would see some investments having significantly higher payoffs than others, resulting in individuals wealths following trended random-walk-like behaviours. So while the expectation for each person is that they would maintain a constant share of total wealth, the global distribution of wealth would not be static and we would not see a “crystallisation of influence”.

Christiano replies: "That seems right–the argument here only implies 'static in expectation.' The actual share of influence could vary considerably."

In a similar spirit, Ben Kuhn mentions in a comment that exogenous shocks, such as the two World Wars, could destroy capital very unequally. Christiano replies: "If you wiped out 50% of the capitalists at random, it wouldn’t really matter: the expected share of the world owned by each capitalist is not affected." But shocks like world wars don't wipe out capitalists at random. Values tend to be clustered geographically, and so are the losses during war. Indeed, the losses that Germany sustained during World War II were partly because of the Third Reich's atrocious values.

Maybe one could argue that investors could just invest in the the entire world market through index funds, such that whether Google or wins, and whether Germany or the USA is hit hard during war, different investors will still get equal returns. But unless all investors use the same index-funds approach, then some investors who happen to invest more aggressively in the winning companies/countries will do better, and others will do worse. Also, some investments are hard to do through an index-fund approach, such as angel investing in very early-stage companies.

Depending on one's context and network, some individuals will inevitably be "more invested" (in various ways) in some projects than others. This is true even informally. For example, people who happened to be close college friends with the person who ends up becoming president of the USA have more power than those who weren't friends with her.

2. System upheavals

In its simplistic form, Christiano's model seems to assume that the infrastructure of capital ownership will continue without major system disruptions for a very long time. For example, property rights will continue to be enforced, the equivalent of a stock market will continue to function, and so on. But it's not clear that this is likely to happen. Historically, almost every seemingly robust political institution that was put in place was eventually destroyed, from ancient Egyptian dynasties to the Soviet Union. While economics drives a significant share of world history, other forces come into play as well, often in unpredictable ways.

Christiano mentions redistribution as one of the reasons that the values of capital owners don't solidify today: "if I were to be making 1% of gross world product in rents, they would probably be aggressively taxed or otherwise confiscated and redistributed more equitably. [...] a large, long-lived investment fund still seems quite likely to be dismantled for the good of the people alive today, one way or another." He goes on to argue that the situation would plausibly change with machine intelligences: "If machine intelligences secure equal [political] representation, and if 1% of machine intelligences share my values, then there is no particular reason to expect redistribution or other political maneuvering to reduce the prevalence of my values." I agree that the expected influence of your values won't necessarily change, but it seems like the actual distribution of values will change, perhaps radically. Redistribution occurs when those who want to redistribute have the power to do so. In a future of machine intelligences, if there are coalitions that want to redistribute wealth and power to other coalitions, they will do so if they can—just like what happens today.

Christiano agrees that conflict could complicate things: "Conflict has another interesting implication, which is that e.g. even if 10% of people shared values X, you might end up with a predictable coordinated effort by everyone else to stamp out value X, which would totally mess with the model (and wouldn’t be changed by the arrival of machine intelligence)." Isn't wealth redistribution in the present world just an example of this? Even if the "top 1%" of people value their own wealth, "a predictable coordinated effort by everyone else" to take some wealth away from those people can lead to less wealth for the top 1%.

"Balance of power" theory in international relations gives other examples where shifting coalitions of actors can exert decidedly non-random influences on the distribution of power in a system of competing values. For example, if a dominant power looks like it might become a global hegemon, other actors may band together to fight against such hegemony. In this case, it's not random whether the dominant power increases or decreases its share of influence; its share of influence is more likely to decrease because it's being targeted on account of its size. Of course, the opposite might be true: perhaps a leading power grows stronger as other countries decide that "if you can't beat 'em, join 'em". This again would be a non-random change in the distribution of power. A priori it's difficult to predict exactly how such dynamics would play out, but it seems implausible to me that competing machine intelligences of the future won't display non-random trends like these.

Perhaps one could argue that if the rule of law breaks down in the future, then progress on machine intelligence will halt, so those future scenarios are irrelevant to predictions about machine intelligence. But while profit-motivated AGI development by Google might indeed slow or cease in the event of a political revolution or world war, military-motivated AGI progress might continue, perhaps at a faster pace than ever. Revolutions throughout history have toppled a ruling class and redistributed its wealth without returning those societies to the Stone Age. For example, following the French Revolution, Napoleon controlled a sizable portion of Europe, and following the Russian Revolution, the Soviet Union became a world superpower.

Another argument might be that even if the rule of law breaks down, as long as I own x% of machine intelligence, I'll have x% of influence in the future anarchy. But this is only true in expectation, and the actual distribution of power is likely to shift over time.

Christiano acknowledges the possibility of a revolution in values: "the transition to machine intelligences may also be an opportunity for influence to shift considerably—perhaps in large part to machines with alien values." I would add that unless robust cooperation is feasible and can be developed quickly after the emergence of smarter-than-human intelligence, then competition, warfare, and "political revolutions" may continue even after alien machine values have taken over the world.

Individual goal drift vs. distributional shift

My replies to Christiano (2014) have argued that the distribution of values is unlikely to be static in a multipolar machine-controlled future. However, if we assume that the individual values that are vying for control are fixed, then one can argue that, at least as best as we can tell ex ante, the expected amount of control for a given value is roughly proportional to its share of capital, wealth, power, etc. in the present or near future. Thus, efforts to promote some values over others can still have significant expected impacts on the far future even if the exact future trajectory is very unpredictable.

More troubling to the idea of having clear expected impacts on the far future is the possibility that the individual values that are competing could themselves mutate—whether because of evolutionary pressures, imperfect goal alignment during self-improvement, or just random change over time. If most of the values that exist today are completely gone in 1000 years and replaced by new values, then a simple "more power now implies more expected power later" argument seems misguided.