Nick Bostrom: Superintelligence: Paths, Dangers, Strategies

Oxford University Press, Oxford, 2014, xvi+328, £18.99, ISBN: 978-0-19-967811-2

  • Book Review
  • Published: 19 June 2015
  • Volume 25 , pages 285–289, ( 2015 )

Cite this article

orthogonality thesis wikipedia

  • Paul D. Thorn 1  

4357 Accesses

2 Citations

3 Altmetric

Explore all metrics

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Author information

Authors and affiliations.

Philosophy Department, Heinrich-Heine-Universitaet Duesseldorf, Universitaetsstr. 1, 40225, Duesseldorf, Germany

Paul D. Thorn

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Paul D. Thorn .

Rights and permissions

Reprints and permissions

About this article

Thorn, P.D. Nick Bostrom: Superintelligence: Paths, Dangers, Strategies. Minds & Machines 25 , 285–289 (2015). https://doi.org/10.1007/s11023-015-9377-7

Download citation

Published : 19 June 2015

Issue Date : August 2015

DOI : https://doi.org/10.1007/s11023-015-9377-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Find a journal
  • Publish with us
  • Track your research
Orthogonality Thesis : Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal.
Motivating Belief Objection : There are certain kinds of true belief about the world that are necessarily motivating, i.e. as soon as an agent believes a particular fact about the world they will be motivated to act in a certain way (and not motivated to act in other ways). If we assume that the number of true beliefs goes up with intelligence, it would then follow that there are certain goals that a superintelligent being must have and certain others that it cannot have.

A particularly powerful version of the motivating belief objection would combine it with a form of moral realism. Moral realism is the view that there are moral facts “out there” in the world waiting to be discovered. A sufficiently intelligent being would presumably acquire more true beliefs about those moral facts. If those facts are among the kind that are motivationally salient — as several moral theorists are inclined to believe — then it would follow that a sufficiently intelligent being would act in a moral way. This could, in turn, undercut claims about a superintelligence posing an existential threat to human beings (though that depends, of course, on what the moral truth really is). The motivating belief objection is itself vulnerable to many objections. For one thing, it goes against a classic philosophical theory of human motivation: the Humean theory. This comes from the philosopher David Hume, who argued that beliefs are motivationally inert. If the Humean theory is true, the motivating belief objection fails. Of course, the Humean theory may be false and so Bostrom wisely avoids it in his defence of the orthogonality thesis. Instead, he makes three points. First, he claims that orthogonality would still hold if final goals are overwhelming, i.e. if they trump the motivational effect of motivating beliefs.

Second, he argues that intelligence (as he defines it) may not entail the acquisition of such motivational beliefs. This is an interesting point. Earlier, I assumed that the better an agent is at means-end reasoning, the more likely it is that its beliefs are going to be true. But maybe this isn’t necessarily the case. After all, what matters for Bostrom’s definition of intelligence is whether the agent is getting what it wants, and it’s possible that an agent doesn’t need true beliefs about the world in order to get what it wants. A useful analogy here might be with Plantinga’s evolutionary argument against naturalism. Evolution by natural selection is a means-end process par excellence : the “end” is survival of the genes, anything that facilitates this is the “means”.

Plantinga argues that there is nothing about this process that entails the evolution of cognitive mechanisms that track true beliefs about the world. It could be that certain false beliefs increase the probability of survival. Something similar could be true in the case of a superintelligent machine. The third point Bostrom makes is that a superintelligent machine could be created with no functional analogues of what we call “beliefs” and “desires”. This would also undercut the motivating belief objection. What do we make of these three responses? They are certainly intriguing. My feeling is that the staunch moral realist will reject the first one. He or she will argue that moral beliefs are most likely to be motivationally overwhelming, so any agent that acquired true moral beliefs would be motivated to act in accordance with them (regardless of their alleged “final goals”). The second response is more interesting. Plantinga’s evolutionary objection to naturalism is, of course, hotly contested. Many argue that there are good reasons to think that evolution would create truth-tracking cognitive architectures. Could something similar be argued in the case of superintelligent AIs? Perhaps.

The case seems particularly strong given that humans would be guiding the initial development of AIs and would, presumably, ensure that they were inclined to acquire true beliefs about the world. But remember Bostrom’s point isn’t that superintelligent AIs would never acquire true beliefs. His point is merely that high levels of intelligence may not entail the acquisition of true beliefs in the domains we might like . This is a harder claim to defeat. As for the third response, I have nothing to say. I have a hard time imagining an AI with no functional analogues of a belief or desire (especially since what counts as a functional analogue of those things is pretty fuzzy), but I guess it is possible. One other point I would make is that — although I may be inclined to believe a certain version of the moral motivating belief objection — I am also perfectly willing to accept that the truth value of that objection is uncertain . There are many decent philosophical objections to motivational internalism and moral realism. Given this uncertainty, and given the potential risks involved with the creation of superintelligent AIs, we should probably proceed for the time being “as if” the orthogonality thesis is true. 3. Conclusion That brings us to the end of the discussion of the orthogonality thesis. To recap, the thesis holds that intelligence and final goals are orthogonal to one another: pretty much any level of intelligence is consistent with pretty much any final goal. This gives rise to the possibility of superintelligent machines with final goals that are deeply antithetical to our own. There are some philosophical objections to this thesis, but even if they are true, their truth values are sufficiently uncertain that we should not discount the orthogonality thesis completely. Indeed, given the potential risks at stake, we should probably proceed “as if” it is true. In the next post, we will look at the instrumental convergence thesis . This follows on from the orthogonality thesis by arguing that even if a superintelligence could have pretty much any final goal, it is still likely to converge on certain instrumentally useful sub-goals. These sub-goals could, in turn, be particularly threatening to human beings.

The Anarchist Library

William Gillis

The orthogonality thesis & ontological crises.

In talking about AI over the last few years Nick Bostrom and Stuart Armstrong have very successfully popularized a more formal and nerdy re-statement of the Humean claim that values and rationality are orthogonal.

I generally like to refer to their Orthogonality Thesis as the most rigorous reformulation and baseline argument for the value-nihilist claim: Thinking about things more will not incline minds to certain values or cause them to inevitably converge to them (but rather leave values more indistinguishable and arbitrary).

In its defense, the space of possible minds is indeed very very big. And just about everyone could do to cultivate a much deeper appreciation for this. But I think the degree to which the Orthogonality Thesis is widely accepted in rationalist circles overreaches. In part because it’s way too easy to just handwave at the definition of “intelligence” and “minds.” But further, just because a state exists within a space doesn’t mean it’s stable or occupies more than an infinitesimal of the probability space. There are, for example, utility functions that do not in any remote sense coherently map onto the physics of our universe. Minds/algorithms that carry these utility functions will simply not function in the sense of processing information in a meaningful way, and will certainly not accomplish their aims. It’s conceptually inefficient and pretty useless to refer to such as “minds.” Physics, mathematics, and computer science sharply — if statistically — constrain the space of possible minds.

Bostrom, Stuart, Yudkowsky et al have, of course, been happy to admit this. But because their emphasis has been in expanding people’s perception of the space of possible minds in order to highlight and underline the threat of AI, folks have largely ignored all the really interesting work that can be done mapping out the boundaries of this space.

Boundaries can end up determining the internal flows and dynamics of the space. Certain cognitive strategies are surely dominant over others, arguably even universally. One might for example suspect, following Wissner-Gross & Freer, that seeking to maximize options (causal path entropy) in as much of a system as possible constitutes a near globally emergent drive. Similarly it’s common to hear talk about rationality skewing our values towards more rationality until our entire utility function is overwritten by Epistemic Rationality and Need More Metaknowledge ! (Note that the hook in this feedbacking process functions because of the structure of our world bends towards rewarding rationality.)

There’s an old quote from an anarchist that I can’t find at the moment for some reason basically warning that nothing is truer for humans than that the strategies we adopt more often than not become ends in themselves.

Silly humans, right? Everyone knows that immediately derivative from valuing something comes an obligation to continue valuing it.

Most folks in the Less Wrong diaspora would proclaim that instrumental rationality is great whereas epistemic rationality means summoning cthulhu in the name of Science! But this is known to be the site of a bunch of big open problems. How do we know when thinking about something any further becomes a bad idea? Why shouldn’t this really just be in the cradle? Insert various arguments here for intelligence and rationality being maladaptive. The lurking danger is that such meta-rational arguments for refusing to engage end up approaching a total hostility to rationality that’s in service to mental ossification and unreflective reaction. Trying to toe some arbitrary middle-ground between radical inquiry and self-preservation often seems to require an endless and expensive array of meta-moves. But does this mean that any rationality besides epistemic is unstable and with that epistemic rationality implies a death of self or any utility function we might identify with?

Anyone capable of easily deciphering the words I’ve written here is of Nerd Tribe and thus constitutionally inclined to biting bullets, but even so most would shy away. The tension between epistemic rationality and instrumental rationality is a big one and many of its central questions are unresolved.

I want to suggest something simple: our frailty of self and imperfect utility functions are not a bug but a feature that enable us to survive ontological crises (the problem of mapping values from one model of reality to another).

Ontological crises are a major challenge facing AI research and in a more pedestrian sense are a notable problem in the lives of regular humans. The thing is humans still manage to weather ontological crises amazingly well. Sure, in the face of really big discrete changes in worldview a handful commit suicide, collapse down to lowest Maslow functionality, become disconnected postmodernists or join some other reassuring cult. But most humans power through. It’s rather impressive. And while the culture of AI research right now finds it valorous to refuse to take any inspiration from homo sapiens, I think they’re missing a hugely important dynamic every time they speak of discrete agents.

Neural networks distribute out consciousness or the processes of our mind, making us a relatively fluid extension of circuits. We don’t compute a uniform utility function at all times but handle it piecemeal. The me that is firing while I get a cup of coffee in the morning is a different me than the me that wanders campus in the evening, with access to a different array of things at different strengths or weightings. Two parts of my brain may trigger in response to something and only one win out to get canonized as part of my conscious narrative, but the other burst of activity may end up altering the strengths certain connections that then affect another later circuit firing through the same area. This enables value and model slippage.

But this is more than holding a “fuzzy” utility function and more than the ultimate physical indistinction between values and models within a neural network. A major component of human cognition is in fact our innate design around holding a multiplicity of perspectives and integrating them. Empathy — in the sense of a blurry sense of self — is a major part of what makes us intelligent. In the most explicit cases we’ll run simulations of someone in our mind — or of some perspective — and then components of that perspective or afterimages of it will leak out and become a part of us, providing our mind with resilience in the face of ontological updates, but also a less solidified or unified utility function.

Yes yes yes, a good fraction of you are neurodivergent / non-neurotypical and are often told that you don’t have visceral empathy and all that jazz, but even if some major expressions or derivative phenomena are missing as a consequence I doubt that’s true 100%. Whatever the dynamics at play in autism, psychopathy, etc, it’s not something as drastic as folks having zero mirror neurons, zero blurring of one’s circuit of thought. (Although it does appear to be that those with more precise and concrete notions of self or less empathy tend to be more brittle in the face of ontological updates.)

Consider: A major ontological update arrives but ends up hitting a bunch of different versions of you — possibly a relatively continuous expanse of different versions of you (on might handwave here and say every plausible combination of activated pathways). Morning you chews on it. Low bloodsugar you chews on it. The you that has just been thinking about your training at a CFAC session chews on it. So to might the you that never stopped thinking about something you were on about earlier but slipped out of sufficient strength to impact your conscious narrative. Your slightly rogue sub-processes and god-voices. Your echos of modeled other minds. Organic expanses of yous distributed across diverse dimensions of meta. They all go chew on this update in myriad ways. Possibly falling back on whatever derivative desires can still be mapped. Possibly prompting stochastically forking. And then some of the expanse of possible yous flounder and others flourish and then remerging happens between them. Additionally there’s horizontal value transfer by virtue of differing processes being run on the same network and thus picking up associations and inclinations left as a result of the other processes running.

This merging process or surrender of the self (surrender of inviolable discrete utility functions) seems to be pretty core to how humans function.

Humans are social creatures, their intelligence is widely recognized as significantly if not entirely derivative from that sociality, and a major part of their cognition centers around argument and forming consensus. My suggestion is that our brains have developed to be particularly good at merging perspectives and sorting out conflicts between them. Not just in terms of models but in terms of values . Internal dialectics, if you will, as a kind of echo of the argumentation we participate in externally. This is a critical component of what enables humans to be scientific and radical thinkers, even if the raw processing power of a sperm whale’s brain outclasses us. Not that we’re good at reason — we’re kinda horrid at it — but that we’re good at functionally surviving the ontological crises that come with it. We’re architecturally open to value drift.

This is no small part of why we abandoned our more immediate Maslow desires on the plains of Africa and set off on this wild and uncontrolled singularity of complexity, cosmopolitanism, and metacognition that has rapidly consumed the world.

If this sort of architectural approach is the only way around Ontological Crises then it follows that any mind capable of doing science will be unfixed and mobile in value-space, tracing out a path in along its gradients and free to fall into any global attractors that might exist, like Wissner-Gross & Freer’s aforementioned maximizing degrees of freedom.

Here’s the takeaway: this suggests that any AI accidentally capable of launching a hard takeoff — which requires doing pioneering science, eg solving protein folding, and/or diagnosing and modeling the existence of human minds — will need to be open to value drift. Now this doesn’t forestall huge classes of dangerous AI, but it does broadly exclude things like paperclip maximizers that tile our future lightcone with paperclips. If they’re solid enough in their utility function to stay with paperclips then any runaway growth will probably be the sort of thing we can see and nuke because the AI won’t be able to undertake the kind of ontological crisis causing radical inquiry necessary to correctly model us or exploit some unexpected scientific discovery.

Now all this may seem like cold comfort, sure. All we’ve argued is that the paperclip maximizers that are dangerous will have to first drift off into some different, possibly weirdly alien utility function before they eat the universe.

But I think it motivates us to try and make informed guesses about the dynamics of the probability space of possible minds. Does the topology of possible values/desires have distinct universal attractors or flows and what are these? The dimensions of considerations are even larger than the state space of possible minds. There’s a lot to be mapped out. But it may well be this cosmos can be substantially predicted by looking at the local physics we have access to.

And incidentally it provides me at least with a tiny bit of cheer, because ultimately my sense of self is so expansive / so stripped away as to arguably converge to merely the undertaking of epistemic rationality — which means that I might well identify strongly with an AI equipped with the necessary radical inquiry and value drift necessary to pose a risk to any attempts to contain it. There’s even a small small small hope that such an AI’s “empathic” mode will place it somewhere on the Sylar/Petrelli spectrum and thus see value in eating/incorporating/discoursing with our minds rather than just dumbly processing our bodies for parts. Thus there’s at least a hope of memetic transfer or cultural transmission to our superpowered children. This sounds like a much better deal than being extinguished entirely! My biggest fear about humanity’s children has long been that in their first free steps they might accidentally discard and erase all the information humanity has acquired in a catastrophe bigger than the Library of Alexandria. I mean I suppose some people would find the being eaten for spare parts more objectionable but heyo.

Nick Bostrom’s Home Page

Founding Director, Future of Humanity Institute, Oxford University (2005—2024) Principal Researcher, Macrostrategy Research Initiative

27 March: The new book is now out! Deep Utopia: Life and Meaning in a Solved World . And you can buy it now . It's long, so maybe an antidote to short attention spans?

Update April 3: Hardcover version sold out (apologies!). Some additional stock might become available at any time; and a second print run is being rushed through. Also available for Kindle.

Update April 8: Hardcover now available again! (Audiobook is also in the works, but that will take several months.)

Sign up for newsletter to receive (rare) updates.

For more, e.g. New Yorker profile (old), Bio , CV , Contact , Press images .

Recent additions

  • Propositions Concerning Digital Minds and Society , w/ Carl Shulman, working paper
  • Base Camp for Mt. Ethics , working paper
  • Sharing the World with Digital Minds , w/ Carl Shulman, in edited volume (Oxford University Press, 2021)
  • The Vulnerable World Hypothesis , in Global Policy . Also German book (Suhrkamp, 2020); adaptation in Aeon
  • Strategic Implications of Openness in AI Development , in Global Policy
  • Hail Mary, Value Porosity, and Utility Diversification , working paper

Selected papers

Ethics & policy, propositions concerning digital minds and society    .

AIs with moral status and political rights? We'll need a modus vivendi, and it’s becoming urgent to figure out the parameters for that. This paper makes a load of specific claims that begin to stake out a position.

Sharing the World with Digital Minds  

Humans are relatively expensive but absolutely cheap.

Strategic Implications of Openness in AI Development

An analysis of the global desirability of different forms of openness (including source code, science, data, safety techniques, capabilities, and goals).

The Reversal Test: Eliminating Status Quo Bias in Applied Ethics  

We present a heuristic for correcting for one kind of bias (status quo bias), which we suggest affects many of our judgments about the consequences of modifying human nature. We apply this heuristic to the case of cognitive enhancements, and argue that the consequentialist case for this is much stronger than commonly recognized.

Dragon

The Fable of the Dragon-Tyrant

Recounts the Tale of a most vicious Dragon that ate thousands of people every day, and of the actions that the King, the People, and an assembly of Dragonologists took with respect thereto.

Astronomical Waste: The Opportunity Cost of Delayed Technological Development

Suns are illuminating and heating empty rooms, unused energy is being flushed down black holes, and our great common endowment of negentropy is being irreversibly degraded into entropy on a cosmic scale. These are resources that an advanced civilization could have used to create value-structures, such as sentient beings living worthwhile lives...

Aleph logo

Infinite Ethics

Cosmology shows that we might well be living in an infinite universe that contains infinitely many happy and sad people. Given some assumptions, aggregative ethics implies that such a world contains an infinite amount of positive value and an infinite amount of negative value. But you can presumably do only a finite amount of good or bad. Since an infinite cardinal quantity is unchanged by the addition or subtraction of a finite quantity, it looks as though you can't change the value of the world. Aggregative consequentialism (and many other important ethical theories) are threatened by total paralysis. We explore a variety of potential cures, and discover that none works perfectly and all have serious side-effects. Is aggregative ethics doomed?

The Unilateralist's Curse: The Case for a Principle of Conformity

In cases where several altruistic agents each have an opportunity to undertake some initiative, a phenomenon arises that is analogous to the winner's curse in auction theory. To combat this problem, we propose a principle of conformity. It has applications in technology policy and many other areas.

Public Policy and Superintelligent AI: A Vector Field Approach

What properties should we want a proposal for an AI governance pathway to have?

Dignity and Enhancement

Does human enhancement threaten our dignity as some have asserted? Or could our dignity perhaps be technologically enhanced? After disentangling several different concepts of dignity, this essay focuses on the idea of dignity as a quality (a kind of excellence admitting of degrees). The interactions between enhancement and dignity as a quality are complex and link into fundamental issues in ethics and value theory.

In Defense of Posthuman Dignity

Brief paper, critiques a host of bioconservative pundits who believe that enhancing human capacities and extending human healthspan would undermine our dignity.

Human Enhancement book cover

Human Enhancement

Original essays by various prominent moral philosophers on the ethics of human enhancement.

Enhancement Ethics: The State of the Debate

The introductory chapter from the book: 1–22

Human Genetic Enhancements: A Transhumanist Perspective

A transhumanist ethical framework for public policy regarding genetic enhancements, particularly human germ-line genetic engineering

Ethical Issues in Human Enhancement

Anthology chapter on the ethics of human enhancement

The Ethics of Artificial Intelligence

Overview of ethical issues raised by the possibility of creating intelligent machines. Questions relate both to ensuring such machines do not harm humans and to the moral status of the machines themselves.

Ethical Issues In Advanced Artificial Intelligence

Some cursory notes; not very in-depth.

Smart Policy: Cognitive Enhancement and the Public Interest

Short article summarizing some of the key issues and offering specific recommendations, illustrating the opportunity and need for "smart policy": the integration into public policy of a broad-spectrum of approaches aimed at protecting and enhancing cognitive capacities and epistemic performance of individuals and institutions.

Base Camp for Mt. Ethics  

New theoretical ideas for a big expedition.

Transhumanism

Why i want to be a posthuman when i grow up.

After some definitions and conceptual clarification, I argue for two theses. First, some posthuman modes of being would be extremely worthwhile. Second, it could be good for human beings to become posthuman.

Letter from Utopia cover

Letter from Utopia  

The good life: just how good could it be? A vision of the future from the future.

The Transhumanist FAQ

The revised version 2.1. The document represents an effort to develop a broadly based consensus articulation of the basics of responsible transhumanism. Some one hundred people collaborated with me in creating this text. Feels like from another era.

Transhumanist Values

Wonderful ways of being may be located in the "posthuman realm", but we can't reach them. If we enhance ourselves using technology, however, we can go out there and realize these values. This paper sketches a transhumanist axiology.

A History of Transhumanist Thought

The human desire to acquire new capacities, to extend life and overcome obstacles to happiness is as ancient as the species itself. But transhumanism has emerged gradually as a distinctive outlook, with no one person being responsible for its present shape. Here's one account of how it happened.

Risk & The Future

Vulnerable World Hypothesis painting by Anne R. Bakke

The Vulnerable World Hypothesis    

Is there a level of technology at which civilization gets destroyed by default?

Where Are They? Why I hope the search for extraterrestrial life finds nothing  

Discusses the Fermi paradox, and explains why I hope we find no signs of life, whether extinct or still thriving, on Mars or anywhere else we look.

Existential Risk Prevention as Global Priority

Existential risks are those that threaten the entire future of humanity. This paper elaborates the concept of existential risk and its relation to basic issues in axiology and develops an improved classification scheme for such risks. It also describes some of the theoretical and practical challenges posed by various existential risks and suggests a new way of thinking about the ideal of sustainability.

How Unlikely is a Doomsday Catastrophe?

Examines the risk from physics experiments and natural events to the local fabric of spacetime. Argues that the Brookhaven report overlooks an observation selection effect. Shows how this limitation can be overcome by using data on planet formation rates.

The Future of Humanity

This paper discusses four families of scenarios for humanity’s future: extinction, recurrent collapse, plateau, and posthumanity.

Global Catastrophic Risks cover

Global Catastrophic Risks

Twenty-six leading experts look at the gravest risks facing humanity in the 21st century, including natural catastrophes, nuclear war, terrorism, global warming, biological weapons, totalitarianism, advanced nanotechnology, general artificial intelligence, and social collapse. The book also addresses overarching issues—policy responses and methods for predicting and managing catastrophes. Foreword by Lord Martin Rees.

The Future of Human Evolution

This paper explores some dystopian scenarios where freewheeling evolutionary developments, while continuing to produce complex and intelligent forms of organization, lead to the gradual elimination of all forms of being worth caring about. We then discuss how such outcomes could be avoided and argue that under certain conditions the only possible remedy would be a globally coordinated effort to control human evolution by adopting social policies that modify the default fitness function of future life forms.

Technological Revolutions: Ethics and Policy in the Dark

Technological revolutions are among the most important things that happen to humanity. This paper discusses some of the ethical and policy issues raised by anticipated technological revolutions, such as nanotechnology.

Existential Risks: Analyzing Human Extinction Scenarios and Related Hazards

Existential risks are ways in which we could screw up badly and permanently. Remarkably, relatively little serious work has been done in this important area. The point, of course, is not to welter in doom and gloom but to better understand where the biggest dangers are so that we can develop strategies for reducing them.

Information Hazards: A Typology of Potential Harms from Knowledge

Information hazards are risks that arise from the dissemination or the potential dissemination of true information that may cause harm or enable some agent to cause harm. Such hazards are often subtler than direct physical threats, and, as a consequence, are easily overlooked. They can, however, be important.

What is a Singleton?

Concept describing a kind of social structure.

Technology Issues

Crucial considerations and wise philanthropy.

How do we know if we are headed in the right direction?

Embryo Selection for Cognitive Enhancement: Curiosity or Game-changer?

The embryo selection during IVF can be vastly potentiated when the technology for stem-cell derived gametes becomes available for use in humans. This would enable iterated embryo selection (IES), compressing the effective generation time in a selection program from decades to months.

How Hard is AI? Evolutionary Arguments and Selection Effects

Some have argued that because blind evolutionary processes produced human intelligence on Earth, it should be feasible for clever human engineers to create human-level artificial intelligence in the not-too-distant future. We evaluate this argument.

The Wisdom of Nature: An Evolutionary Heuristic for Human Enhancement

Human beings are a marvel of evolved complexity. Such systems can be difficult to enhance. Here we describe a heuristic for identifying and evaluating the practicality, safety and efficacy of potential human enhancements, based on evolutionary considerations.

The Evolutionary Optimality Challenge

Human beings are a marvel of evolved complexity. Such systems can be difficult to upgrade. We describe a heuristic for identifying and evaluating potential human enhancements, based on evolutionary considerations.

The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents

Presents two theses, the orthogonality thesis and the instrumental convergence thesis, that help understand the possible range of behavior of superintelligent agents—also pointing to some potential dangers in building such an agent.

Whole Brain Emulation cover

Whole Brain Emulation: A Roadmap

A 130-page report on the technological prerequisites for whole brain emulation (aka "mind uploading").

Converging Cognitive Enhancements

Cognitive enhancements in the context of converging technologies.

Hail Mary, Value Porosity, and Utility Diversification

Some new ideas related to the challenge of endowing a hypothetical future superintelligent AI with values that would cause it to act in ways that are beneficial. Paper is somewhat obscure.

Racing to the Precipice: a Model of Artificial Intelligence Development

Game theory model of a technology race to develop AI. Participants skimp on safety precautions to get there first. Analyzes factors that determine level of risk in the Nash equilibrium.

Thinking Inside the Box: Controlling and Using Oracle AI

Preliminary survey of various issues related to the idea of using boxing methods to safely contain a superintelligent oracle AI.

Future Progress in Artificial Intelligence: A Survey of Expert Opinion

Some polling data.

Cognitive Enhancement: Methods, Ethics, Regulatory Challenges

Cognitive enhancement comes in many diverse forms. In this paper, we survey the current state of the art in cognitive enhancement methods and consider their prospects for the near-term future. We then review some of ethical issues arising from these technologies. We conclude with a discussion of the challenges for public policy and regulation created by present and anticipated methods for cognitive enhancement.

Simulation Argument logo

Are You Living in a Computer Simulation?  

This paper argues that at least one of the following propositions is true: (1) the human species is very likely to go extinct before reaching the posthuman stage; (2) any posthuman civilization is extremely unlikely to run significant number of simulations or (variations) of their evolutionary history; (3) we are almost certainly living in a computer simulation. It follows that the naïve transhumanist dogma that there is a significant chance that we will one day become posthumans who run ancestor-simulations is false, unless we are currently living in a simulation. A number of other consequences of this result are also discussed.

Deep Utopia cover

Deep Utopia: Life and Meaning in a Solved World    

A greyhound catching the mechanical lure—what would he actually do with it? Has he given this any thought?

“Wow.” —Prof. Erik Brynjolfsson, Stanford University;  Co-author of ‘The Second Machine Age’
“Yeah.” —Elon Musk
“A fascinating book.” —Peter Coy, The New York Times
“A major contribution to human thought and ways of thinking.” —Robert Lawrence Kuhn
“A really fun, and important, book… the writing is brilliant… incredibly rich… a constant parade of fascinating ideas.” —Prof. Guy Kahane, University of Oxford
“Brilliant! Hilarious, poignant, insightful, clever, important.” —Prof. Thaddeus Metz, University of Pretoria;  Author of ‘Meaning in Life’

The Previous Book

Superintelligence cover

Superintelligence: Paths, Dangers, Strategies  

New York Times logo

“I highly recommend this book.” —Bill Gates
“very deep … every paragraph has like six ideas embedded within it.” —Nate Silver
“terribly important … groundbreaking” “extraordinary sagacity and clarity, enabling him to combine his wide-ranging knowledge over an impressively broad spectrum of disciplines – engineering, natural sciences, medicine, social sciences and philosophy – into a comprehensible whole” “If this book gets the reception that it deserves, it may turn out the most important alarm bell since Rachel Carson's Silent Spring from 1962, or ever.” —Olle Haggstrom, Professor of Mathematical Statistics
“Nick Bostrom makes a persuasive case that the future impact of AI is perhaps the most important issue the human race has ever faced. … It marks the beginning of a new era.” —Stuart Russell, Professor of Computer Science, University of California, Berkeley
“Those disposed to dismiss an 'AI takeover' as science fiction may think again after reading this original and well-argued book.” —Martin Rees, Past President, Royal Society
“Worth reading…. We need to be super careful with AI. Potentially more dangerous than nukes” —Elon Musk
“There is no doubting the force of [Bostrom's] arguments … the problem is a research challenge worthy of the next generation's best mathematical talent. Human civilisation is at stake.” —Financial Times
“This superb analysis by one of the world's clearest thinkers tackles one of humanity's greatest challenges: if future superhuman artificial intelligence becomes the biggest event in human history, then how can we ensure that it doesn't become the last?” —Professor Max Tegmark, MIT
“a damn hard read” —The Telegraph

Anthropics & Probability

Anthropic Bias cover

Anthropic Bias: Observation Selection Effects in Science and Philosophy  

Failure to consider observation selection effects result in a kind of bias that infest many branches of science and philosophy. This book presented the first mathematical theory for how to correct for these biases. It also discusses some implications for cosmology, evolutionary biology, game theory, the foundations of quantum mechanics, the Doomsday argument, the Sleeping Beauty problem, the search for extraterrestrial life, the question of whether God exists, and traffic planning.

Self-Locating Belief in Big Worlds: Cosmology's Missing Link to Observation

Current cosmological theories say that the world is so big that all possible observations are in fact made. But then, how can such theories be tested? What could count as negative evidence? To answer that, we need to consider observation selection effects.

The Mysteries of Self-Locating Belief and Anthropic Reasoning

Summary of some of the difficulties that a theory of observation selection effects faces and sketch of a solution.

Anthropic Shadow: Observation Selection Effects and Human Extinction Risks

"Anthropic shadow" is an observation selection effect that prevents observers from observing certain kinds of catastrophes in their recent geological and evolutionary past. We risk underestimating the risk of catastrophe types that lie in this shadow.

Observation Selection Effects, Measures, and Infinite Spacetimes

An advanced Introduction to observation selection theory and its application to the cosmological fine-tuning problem.

The Doomsday argument and the Self-Indication Assumption: Reply to Olum

Argues against Olum and the Self-Indication Assumption.

The Doomsday Argument is Alive and Kicking

Have Korb and Oliver refuted the doomsday argument? No.

The Doomsday Argument, Adam & Eve, UN++, and Quantum Joe

On the Doomsday argument and related paradoxes.

A Primer on the Doomsday argument

The Doomsday argument purports to prove, from basic probability theory and a few seemingly innocuous empirical premises, that the risk that our species will go extinct soon is much greater than previously thought. My view is that the Doomsday argument is inconclusive—although not for any trivial reason. In my book , I argued that a theory of observation selection effects is needed to explain where it goes wrong.

Sleeping Beauty and Self-Location: A Hybrid Model

The Sleeping Beauty problem is an important test stone for theories about self-locating belief. I argue against both the traditional views on this problem and propose a new synthetic approach.

Beyond the Doomsday Argument: Reply to Sowers and Further Remarks

Argues against George Sower's refutation of the doomsday argument, and outlines what I think is the real flaw.

Cars In the Other Lane Really Do Go Faster

When driving on the motorway, have you ever wondered about (and cursed!) the fact that cars in the other lane seem to be getting ahead faster than you? One might be tempted to account for this by invoking Murphy's Law ("If anything can go wrong, it will", discovered by Edward A. Murphy, Jr, in 1949). But there is an alternative explanation, based on observational selection effects…

Observer-relative chances in anthropic reasoning?

A paradoxical thought experiment

Cosmological Constant and the Final Anthropic Hypothesis

Examines the implications of recent evidence for a cosmological constant for the prospects of indefinite information processing in the multiverse. Co-authored with Milan M. Cirkovic.

Philosophy of Mind

Quantity of experience: brain-duplication and degrees of consciousness.

If two brains are in identical states, are there two numerically distinct phenomenal experiences or only one? Two, I argue. But what happens in intermediary cases? This paper looks in detail at this question and suggests that there can be a fractional (non-integer) number of qualitatively identical experiences. This has implications for what it is to implement a computation and for Chalmer's Fading Qualia thought experiment.

Decision Theory

The meta-newcomb problem.

A self-undermining variant of the Newcomb problem.

Pascal's Mugging

Finite version of Pascal's Wager.

Nick Bostrom is a Swedish-born philosopher with a background in theoretical physics, computational neuroscience, logic, and artificial intelligence, along with philosophy. He is one of the most-cited philosophers in the world, and has been referred to as “the Swedish superbrain”.

He’s been a Professor at Oxford University, where he served as the founding Director of the Future of Humanity Institute from 2005 until its closure in April 2024. He is currently the founder and Director of Research of the Macrostrategy Research Initiative.

Bostrom is the author of 200 publications, including Anthropic Bias (2002), Global Catastrophic Risks (2008), Human Enhancement (2009), and Superintelligence: Paths, Dangers, Strategies (2014), a New York Times bestseller which helped spark a global conversation about the future of AI. His work has pioneered many of the ideas that frame current thinking about humanity’s future (such as the concept of an existential risk, the simulation argument, the vulnerable world hypothesis, the unilateralist’s curse, etc.), while some of his recent work concerns the moral status of digital minds. His most recent book Deep Utopia: Life and Meaning in a Solved World , was published on 27 March 2024.

His writings have been translated into more than 30 languages; he is a repeat main-stage TED speaker; he has been on Foreign Policy’s Top 100 Global Thinkers list twice and was included in Prospect’s World Thinkers list, the youngest person in the top 15. As a graduate student he dabbled in stand-up comedy on the London circuit.

For more background, see profiles in e.g. The New Yorker or Aeon .

My research interests might on the surface appear scattershot, but they share a single aim, which is to better understand what I refer to as our “macrostrategic situation”: the larger context in which human civilization exists, and in particular how our current choices relate to ultimate outcomes or to important values. Basically, I think we are fairly profoundly in the dark on these matters. We are like ants who are busy building an ant hill but with little notion of what they are doing, why they are doing it, or whether in the final reckoning it will have been a good idea to do it.

I’ve now been alive long enough to have seen a significant shift in attitudes to these questions. Back in the 90s, they were generally regarded as discreditable futurism or science fiction - certainly within academia. They were left to a small set of “people on the Internet”, who were at that time starting to think through the implications of future advances in AI and other technologies, and what these might mean for human society. It seemed to me that the questions were important and deserved more systematic exploration. That’s why I founded the Future of Humanity Institute at Oxford University in 2005. FHI brought together an interdisciplinary bunch of brilliant (and eccentric!) minds, and sought to shield them as much as possible from the pressures of regular career academia; and thus were laid the foundations for exciting new fields of study.

Those were heady years. FHI was a unique place - extremely intellectually alive and creative - and remarkable progress was made. FHI was also quite fertile, spawning a number of academic offshoots, nonprofits, and foundations. It helped incubate the AI safety research field, the existential risk and rationalist communities, and the effective altruism movement. Ideas and concepts born within this small research center have since spread far and wide, and many of its alumni have gone on to important positions in other institutions.

Today, there is a much broader base of support for the kind of work that FHI was set up to enable, and it has basically served out its purpose. (The local faculty administrative bureaucracy has also become increasingly stifling.) I think those who were there during its heyday will remember it fondly. I feel privileged to have been a part of it and to have worked with the many remarkable individuals who flocked around it. [Update: FHI officially closed on 16 April 2024]

As for my own research, this webpage itself is perhaps its best summary. Aside from my work related to artificial intelligence (on safety, ethics, and strategic implications), I have also originated or contributed to the development of ideas such as simulation argument, existential risk, transhumanism, information hazards, astronomical waste, crucial considerations, observation selection effects in cosmology and other contexts of self-locating belief, anthropic shadow, the unilateralist’s curse, the parliamentary model of decision-making under normative uncertainty, the notion of a singleton, the vulnerable world hypothesis, alongside analyses of future technological capabilities and concomitant ethical issues, risks, and opportunities. More recently, I’ve been doing work on the moral and political status of digital minds, and on some issues in metaethics. I also have a book manuscript, many years in the making, which is now complete and which will be published in the spring of 2024. [Update: now out!]

I’ve noticed that misconceptions are sometimes formed by people who’ve only read some bits of my work. For instance, that I’m a gung-ho cheerleading transhumanist, or that I’m anti-AI, or that I’m some sort of consequentialist fanatic who would favor any policy purported to mitigate some existential risk. I suspect the cause of such errors is that many of my papers investigate particular aspects of some complex issue, or trace out the implications that would follow from some particular set of assumptions. This is an analytic strategy: carve out parts of a problem where one can see how to make intellectual progress, make that progress, and then return to see if one can find ways to make additional parts tractable. My actual overall views on the challenges confronting us are far more nuanced, complicated, and tentative. This has always been the case, and it has become even more so as I’ve mellowed with age. But I’ve never been temperamentally inclined towards strong ideological positions or indeed “-ism”s of any kind.

For media and most other inquiries, please contact my Executive Assistant Emily Campbell:

[email protected]

Press images: this page

If you need to contact me directly (I regret I am not always able to respond to emails):

Receive (rare) updates:

Virtual Estate

simulation-argument.com — Devoted to the question, "Are you living in a computer simulation?"

www.anthropic-principle.com — Papers on observational selection effects

www.existential-risk.org — Human extinction scenarios and related concerns

ON THE BANK On the bank at the end Of what was there before us Gazing over to the other side On what we can become Veiled in the mist of naïve speculation We are busy here preparing Rafts to carry us across Before the light goes out leaving us In the eternal night of could-have-been

Some Videos & Lectures

Nick bostrom: how ai will lead to tyranny    .

AI safety, AI policy, and digital minds.

TED2019  

Professor Nick Bostrom chats about the vulnerable world hypothesis with Chris Anderson.

Podcast with Sean Carroll

On anthropic selection theory and the simulation argument.

Podcast with Lex Fridman

Discussion on the simulation argument with Lex Fridman.

TED talk on AI risk

My second TED talk

Some additional (old, cobwebbed) papers

On the old papers page .

The New Yorker logo

The Doomsday Invention: Will artificial intelligence bring us utopia or destruction?  

A long-form feature profile of me, by Raffi Khatchadourian.

Omens  

Long article by Ross Andersen about the work of the Future of Humanity Institute

How to make a difference in research: interview for 80,000 Hours

Interview for the meta-charity 80,000 Hours on how to make a maximally positive impact on the world for people contemplating an academic career trajectory

On the simulation argument

15-minute audio interview explaining the simulation argument.

On cognitive enhancement and status quo bias

15-minute interview about status quo bias in bioethics, and the "reversal test" by which such bias might be cured.

50-min interview on Hearsay Culture

Covering Future of Humanity Institute, crucial considerations, existential risks, information hazards, and academic specialization. Interviewed by Prof. Dave Levine, KZSU-FM.

Summarizing some of the key issues and offering policy recommendations for a "smart policy" on biomedical methods of enhancing cognitive performance.

Three Ways to Advance Science

Those who seek the advancement of science should focus more on scientific research that facilitates further research across a wide range of domains — particularly cognitive enhancement.

What are the key steps the UK should take to maximise its resilience to natural hazards and malicious threats?

In response to the call for evidence for the UK government's 2020 Integrated Review.

Drugs can be used to treat more than disease

Short letter to the editor on obstacles to the development of better cognitive enhancement drugs.

Miscellaneous

The interests of digital minds.

A blog post draft.

Fictional interview of an uploaded dog by Larry King.

A poetry cycle… in Swedish, unfortunately. I stopped writing poetry after this, although I've had a few relapses in the English language.

The World in 2050

Imaginary dialogue, set in the year 2050, in which three pundits debate the big issues of their time

Transhumanism: The World's Most Dangerous Idea?

According to Francis Fukuyama, yes. This is my response.

Moralist, meet Scientist

Review of Kwame Anthony Appiah's book "Experiments in Ethics".

How Long Before Superintelligence?

This paper, now a few years old, examines how likely it might be that we will develop superhuman artificial intelligence within the first third of this century.

When Machines Outsmart Humans

This slightly more recent (but still obsolete) article briefly reviews the argument set out in the previous one, and notes four immediate consequences of human-level machine intelligence.

Response to 2008 Edge Question: "What have you changed your mind about?"

Superintelligence

Response to 2009 Edge Question: "What will change everything?"

Most Still to Come

Response to 2010 Edge Question: "How is the Internet changing the way you think?"

The Game of Life—And Looking for Generators

Response to 2011 Edge Question: "What scientific concept would improve everybody's cognitive toolkit?"

Some autobiographical fragments, in Swedish

Transcript of radio program.

Poetry

LESSWRONG LW

Self-reference breaks the orthogonality thesis.

One core obstacle to AI Alignment is the Orthogonality Thesis. The Orthogonality Thesis is usually defined as follows: "the idea that the final goals and intelligence levels of artificial agents are independent of each other". More careful people say "mostly independent" instead. Stuart Armstrong qualifies the above definition with "(as long as these goals are of feasible complexity, and do not refer intrinsically to the agent’s intelligence)".

Does such a small exception matter? Yes it does.

The exception is broader than Stuart Armstrong makes it sound. The Orthogonality Thesis does not just apply to any goal which refers to an agent's intelligence level. It refers to any goal which refers even to a component of the agent's intelligence machinery .

If you're training an AI to optimize an artificially constrained external reality like a game of chess or Minecraft then the Orthogonality Thesis applies in its strongest form. But the Orthogonality Thesis cannot ever apply in full to the physical world we live in.

A world-optimizing value function is defined in terms of the physical world. If a world-optimizing AI is going to optimize the world according to a world-optimizing value function then the world-optimizing AI must understand the physical world it operates in. If a world-optimizing AI is real then it, itself, is part of the physical world. A powerful world-optimizing AI would be a very important component of the physical world, the kind that cannot be ignored. A powerful world-optimizing AI's world model must include a self-reference pointing at itself. Thus, a powerful world-optimizing AI is necessarily an exception to the Orthogonality Thesis.

How broad is this exception? What practical implications does this exception have?

Let's do some engineering. A strategic world-optimizer has three components:

  • A robust, self-correcting, causal model of the Universe.
  • A value function which prioritizes some Universe states over other states.
  • A search function which uses the causal model and the value function to calculate select what action to take.

Notice that there are two different optimizers working simultaneously. The strategic search function is the more obvious optimizer. But the model updater is an optimizer too. A world-optimizer can't just update the universe toward its explicit value function. It must also keep its model of the Universe up-to-date or it'll break.

These optimizers are optimizing toward separate goals. The causal model wants its model of the Universe to be the same as the actual Universe. The search function wants the Universe to be the same as its value function.

You might think the search function has full control of the situation. But the world model affects the universe indirectly. What the world model predicts affects the search function which affects the physical world. If the world model fails to account for its own causal effects then the world model will break and our whole AI will stop working.

It's actually the world model which mostly has control of the situation. The world model can control the search function by modifying what the search function observes. But the only way the search function can affect the world model is by modifying the physical world (wireheading itself).

What this means is that the world model has an causal lever for controlling the physical world. If the world model is a superintelligence optimized for minimizing its error function, then the world model will hack the search function to eliminate its own prediction error by modifying the physical world to conform with the world model's incorrect predictions.

If your world model is too much smarter than your search function, then your world model will gaslight your search function. You can solve this by making your search function smarter. But if your search function is too much smarter than your world model, then your search function will physically wirehead your world model.

Unless…you include "don't break the world model" [1] as part of your explicit value function.

If you want to keep the search function from wireheading the world model then you have to code "don't break the world model" into your value function. This is a general contradiction to the Orthogonality Thesis. A sufficiently powerful world-optimizing artificial intelligence must have a value function that preserves the integrity of its world model, because otherwise it'll just wirehead itself, instead of optimizing the world. This effect provides a smidgen of corrigibility; if the search function does corrupt its world model, then the whole system (world optimizer) breaks.

Does any of this matter? What implications could this recursive philosophy possibly have on the real world?

It means that if you want to insert a robust value into a world-optimizing AI then you don't put it in the value function. You sneak it into the world model, instead.

[Here's where you ask yourself whether this whole post is just me trolling you. Keep reading to find out.]

A world model is a system that attempts to predict its signals in real time. If you want the system to maximize accuracy then your error function is just the difference between predicted signals and actual signals. But that's not quite good enough, because a smart system will respond by cutting off its input stimuli in exactly the same way a meditating yogi does. To prevent your world-optimizing AI from turning itself into a buddha, you need to reward it for seeking novel, surprising stimuli.

…especially after a period of inaction or sensory deprivation.

…which is why food tastes so good and images look so beautiful after meditating.

If you want your world model to modify the world too, you can force your world model to predict the outcomes you want, and then your world model will gaslight your search function into making them happen.

Especially if you deliberately design your world model to be smarter than your search function. That way, your world model can mostly [2] predict the results of the search function.

Which is why we have a bias toward thinking we're better people than we actually are . At least, I do. It's neither a bug nor a feature. It's how evolution motivates us to be better people.

With some exceptions like, "If I'm about to die then it doesn't matter that the world model will die with me." ↩︎

The world model can't entirely predict the results of the search function, because the search function's results partly depend on the world model—and it's impossible (in general) for the world model to predict its own outputs, because that's not how the arrow of time works. ↩︎

The Orthogonality Thesis is usually defined as follows: "the idea that the final goals and intelligence levels of artificial agents are independent of each other". More careful people say "mostly independent" instead.

By whom? That's not the definition given here: https://arbital.com/p/orthogonality/   Quoting:

The Orthogonality Thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal.  The strong form of the Orthogonality Thesis says that there's no extra difficulty or complication in creating an intelligent agent to pursue a goal, above and beyond the computational tractability of that goal.

I started with this one from LW's Orthogonality Thesis tag.

The Orthogonality Thesis states that an agent can have any combination of intelligence level and final goal, that is, its Utility Functions(127) and General Intelligence(92) can vary independently of each other. This is in contrast to the belief that, because of their intelligence, AIs will all converge to a common goal.

But it felt off to me so I switched to Stuart Armstrong's paraphrase of Nick Bostrom's formalization in “The Superintelligent Will”.

How does the definition I use differ in substance from Arbital's? It seems to make no difference to my argument that the cyclic references implicit to embedded agency impose a constraint on the kinds of goals arbitrarily intelligent agents may pursue.

One could argue that Arbital's definition already accounts for my exception because self-reference causes computational intractability.

What seems off to me about your definition is that it says goals and intelligence are independent, whereas the Orthogonality Thesis only says that they can in principle be independent , a much weaker claim.

What's your source for this definition?

See for example Bostrom's original paper ( pdf ):

The Orthogonality Thesis Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.

It makes no claim about how likely intelligence and final goals are to diverge, it only claims that it's in principle possible to combine any intelligence with any set of goals. Later on in the paper he discusses ways of actually predicting the behavior of a superintelligence, but that's beyond the scope of the Thesis.

I'm just making a terminological point. The terminological point seems important because the Orthogonality Thesis (in Yudkowsky's sense) is actually denied by some people, and that's a blocker for them understanding AI risk.  On your post: I think something's gone wrong when you're taking the world modeling and "the values" as separate agents in conflict. It's a sort of homunculus argument https://en.wikipedia.org/wiki/Homunculus_argument w.r.t. agency. I think the post raises interesting questions though. 

If, on my first Internet search, I had found Yudkowsky defining the "Orthogonality Thesis", then I probably would have used that definition instead. But I didn't, so here we are.

Maybe a less homunculusy way to explain what I'm getting at is that an embedded world-optimizer must optimize simultaneously toward two distinct objectives: toward a correct world model and toward an optimized world. This applies a constraint to the Orthogonality Thesis, because the world model is embedded in the world itself.

But you can just have the world model as an instrumental subgoal. If you want to do difficult thing Z, then you want to have a better model of the parts of Z, and the things that have causal input to Z, and so on. This motivates having a better world model. You don't need a separate goal, unless you're calling all subgoals "separate goals".   Obviously this doesn't work as stated because you have to have a world model to start with, which can support the implication that "if I learn about Z and its parts, then I can do Z better". 

Congratulations, you discovered [Active Inference]!

Do you mean the free energy principle?

Sure, I mean that it is an implementation of what you mentioned in the third-to-last paragraph.

I think the good part of this post is a reobservation of the fact that real-world intelligence requires power-seeking (and power-seeking involves stuff like making accurate world-models) and that the bad part of the post seems to be confusion about how feasible it is to implement power-seeking and what methods would be used.

Three thoughts:

  • If you set up the system like that, you may run into the mentioned problems. It might be possible wrap both into a single model that is trained together.
  • An advanced system may reason about the joint effect, e.g. by employing fixed-point theorems and Logical Induction .
  • world model that is mainly trained by prediction error 
  • a steering system that encodes preferences over world states
  • a  system that learns how world model predictions relate to steering system feedback

I think this is deeply confused. In particular, you are confusing search and intelligence. Intelligence can be made by attaching a search component, a utility function and a world model. The world model is actually an integral, but it can be approximated by a search by searching for several good hypothesis instead of integrating over all hypothesis. 

In this approximation, the world model is searching for hypothesis that fit the current data. 

To deceive the search function part of the AI, the "world model" must contain a world model section that actually models the world so it can make good decisions, and an action chooser that compares various nonsensical world models according to how they make the search function and utility function break. In other words, to get this failure mode, you need fractal AI, an AI built by gluing 2 smaller AI's together, each of which is in turn made of 2 smaller AI's and so on ad infinitum. 

Some of this discssion may point to an ad hoc hack evolution used in humans. Though most of it sounds so ad hoc even evolution would bawk. None is sane AI design. Your "search function" is there to be outwitted by the world model, with the world model inventing insane and contrived imaginary worlds in order to trick the search function into doing what the world model wants. Ie the search function would want to turn left if it had a sane picture of the world because it's a paperclip maximizer and all the paperclips are to the left. The world model wants to turn right for less/more sensory stimuli. So the world model gaslights the search function, imagining up a hoard of zombies to the left. (While internally keeping track of the lack of zombies.) Thus scaring the search function into going right. At the very least, this design wastes compute imagining zombies. 

The world model is actually an integral, but it can be approximated by a search by searching for several good hypothesis instead of integrating over all hypothesis.

Can you tell me what you mean by this statement? When you say "integral" I think "mathematical integral (inverse of derivative)" but I don't think that's what you intend to communicate.

Yes integral is exactly what I intended to communicate. 

To really evaluate an action, you need to calculate  ∫ P ( x ) U a ( x ) d x  an integral over all hypothesis. 

If you don't want to behave with maximum intelligence, just pretty good intelligence, then you can run gradient descent to find a point X by trying to maximize  P ( x ) . Then you can calculate  U a ( X )  to compare actions. More sophisticated methods would sum several points. 

This is partly using the known structure of the problem. If you have good evidence, then the function  P ( x )  is basically 0 almost everywhere. So if  U a ( X )  is changing fairly slowly over the region that is significantly nonzero, looking at any nonzero point of  P ( x )  is a good estimate of the integral.  

any goal which refers even to a component of the agent's intelligence machinery

But wouldn't such an agent still be motivated to build an external optimizer of unbounded intelligence? (Or more generally unconstrained design.) This does reframe things a bit, but mostly by retargeting the "self" pointers to something separate from the optimizer (to the original agent, say). This gives the optimizer (which is a separate thing from the original agent) a goal with no essential self-references (other than what being embedded in the same world entails).

Humans seem like this anyway, with people being the valuable content that shouldn't be optimized for purely being good at optimization, while the cosmic endowment still needs optimizing, so it's something else with human values that should do that, something that is optimized for being good at optimizing, rather than for being valuable content of the world.

But wouldn't such an agent still be motivated to build an external optimizer of unbounded intelligence?

Yes, if it can. Suppose the unbounded intelligence is aligned with the original agent via CEV. The original agent has a pointer pointing to the unbounded intelligence. The unbounded intelligence has a pointer pointing to itself and (because of CEV) a pointer pointing to the original agent. There are now two cyclic references. We have lost our original direct self-reference, but it's the cyclicness that is central to my post, not self-reference, specifically. Self-reference is just a particular example of the general exception.

Does that make sense? The above paragraph is kind of vague, expecting you to fill in the gaps. (I cheated too, by assuming CEV.) But I can phrase things more precisely and break them into smaller pieces, if you would prefer it that way.

It's embedded in a world ( edit : external optimizer is), so there is always some circularity, but I think that's mostly about avoiding mindcrime and such? That doesn't seem like a constraint on level of intelligence, so the orthogonality thesis should be content. CEV being complicated and its finer points being far in the logical future falls under goal complexity and doesn't need to appeal to cyclic references.

The post says things about wireheading and world models and search functions, but it's optimizers with unconstrained design we are talking about. So a proper frame seems to be decision theory, which is unclear for embedded agents, and a failing design is more of a thought experiment that motivates something about a better decision theory.

When you say "It's", are you referring to the original agent or to the unbounded intelligence it wants to create? I think you're referring to the unbounded intelligence, but I want to be sure.

To clarify: I never intended to claim that the Orthogonality Thesis is violated due to a constraint on the level of intelligence. I claim that the Orthogonality Thesis is violated due to a constraint on viable values, after the intelligence of a world optimizer gets high enough.

Both are embedded in the world, but I meant the optimizer in that sentence. The original agent is even more nebulous than the unconstrained optimizer, since it might be operating under unknown constraints on design. (So it could well be cartesian, without self references. If we are introducing a separate optimizer, and only keeping idealized goals from the original agent, there is no more use for the original agent in the resulting story.)

In any case, a more general embedded decision theoretic optimizer should be defined from a position of awareness of the fact that it's acting from within its world. What this should say about the optimizer itself is a question for decision theory that motivates its design.

Are you trying to advocate for decision theory? You write that this is "a question for decision theory". But you also write that decision theory is "unclear for embedded agents". And this whole conversation exclusively is about embedded agents. What parts are you advocating we use decision theory on and what parts are you advocating we don't use decision theory on? I'm confused.

You write that this is "a question for decision theory". But you also write that decision theory is "unclear for embedded agents".

It's a question of what decision theory for embedded agents should be, for which there is no clear answer. Without figuring that out, designing an optimizer is an even more murky endeavor, since we don't have desiderata for it that make sense, which is what decision theory is about. So saying that decision theory for embedded agents is unclear is saying that designing embedded optimizers remains an ill-posed problem.

I'm combining our two theads into one. Click here for continuation.

[Note: If clicking on the link doesn't work, then that's a bug with LW. I used the right link.]

[Edit: It was the wrong link.]

If clicking on the link doesn't work, then that's a bug with LW. I used the right link.

It is something of a bug with LW that results in giving you the wrong link to use (notice the #Wer2Fkueti2EvqmqN part of the link, which is the wrong part). The right link is this . It can be obtained by clicking "See in context" at the top of the page. (The threads remain uncombined, but at least they now have different topics.)

Fixed. Thank you.

Oh! I think I understand your argument now. If I understand it correctly (and I might not) then your argument is an exception covered by this footnote . Creating an aligned superintelligence ends the need for maintaining a correct world model in the future for the same reason dying does: your future agentic impact after the pivotal act is negligible.

My argument is a vague objection to the overall paradigm of "let's try to engineer an unconstrained optimizer", I think it makes more sense to ask how decision theory for embedded agents should work, and then do what it recommends. The post doesn't engage with that framing in a way I easily follow, so I don't really understand it.

The footnote appears to refer to something about the world model component of the engineered optimizer you describe? But also to putting things into the goal, which shouldn't be allowed? General consequentialist agents don't respect boundaries of their own design and would eat any component of themselves such as a world model if that looks like a good idea. Which is one reason to talk about decision theories and not agent designs.

My post doesn't engage with your framing at all . I think decision theory is the wrong tool entirely, because decision theory takes as a given the hardest part of the problem. I believe decision theory cannot solve this problem, and I'm working from a totally different paradigm.

Our disagreement is as wide as if you were a consequentialist and I was arguing from a Daoist perspective. (Actually, that might not be far from the truth. Some components of my post have Daoist influences.)

Don't worry about trying to understand the footnote. Our disagreement appears to run much deeper than it.

because decision theory takes as a given the hardest part of the problem

What's that?

My post doesn't engage with your framing at all .

Sure, it was intended as a not-an-apology for not working harder to reframe implied desiderata behind the post in a way I prefer. I expect my true objection to remain the framing, but now I'm additionally confused about the "takes as a given" remark about decision theory, nothing comes to mind as a possibility.

It's philosophical. I think it'd be best for us to terminate the conversation here. My objections against the over-use of decision theory are sophisticated enough (and distinct enough from what this post is about) that they deserve their own top-level post.

My short answer is that decision theory is based on Bayesian probability, and that Bayesian probability has holes related to a poorly-defined (in embedded material terms) concept of "belief".

Thank you for the conversation, by the way. This kind of high-quality dialogue is what I love about LW.

Sure. I'd still like to note that I agree about Bayesian probability being a hack that should be avoided if at all possible, but I don't see it as an important part (or any part at all) of framing agent design as a question of decision theory (essentially, of formulating desiderata for agent design before getting more serious about actually designing them).

For example, proof-based open source decision theory simplifies the problem to a ridiculous degree to more closely examine some essential difficulties of embedded agency (including self-reference), and it makes no use of probability, both in its modal logic variant and not. Updatelessness more generally tries to live without Bayesian updating.

Though there are always occasions to remember about probability, like the recent mystery about expected utility and updatelessness .

In the models making the news and scaring people now, there aren't identified separate models for modeling the world and seeking the goal. It's all inscrutible model weights. Maybe if we understood those weights better we could separate them out. But maybe we couldn't. Maybe it's all a big jumble as actually implemented. That would make it incoherent to speak about the relative intelligence of the world model and the goal seeker. So how would this line of thinking apply to that?

If you want to keep the search function from wireheading the world model then you have to code "don't break the world model" into your value function. This is a general contradiction to the Orthogonality Thesis. A sufficiently powerful world-optimizing artificial intelligence must have a value function that preserves the integrity of its world model, because otherwise it'll just wirehead itself, instead of optimizing the world.

If the value function says ~"maximise the number of paperclips, as counted by my paperclip-counting-machinery", a weak AI might achieve this by making paperclips, but a stronger AI might trick the paperclip-counting-machinery into counting arbitrarily many paperclips, rather than actually making any paperclips.

However, this isn't a failure of the Orthogonality Thesis, because that value function doesn't say "maximise the number of real paperclips".  The value function, as stated, was weakly satisfied by the weak AI, and strongly satisfied by the strong AI.  The strong AI did maximise the number of paperclips, as counted by its paperclip-counting-machinery.  Any value function which properly corresponds to "maxmise the number of real paperclips" would necessarily include protections against wireheading.

If you try to program an AI to have the goal of doing X, and it does Y instead, there's a good chance the "goal you thought would lead to X" was actually a goal that leads to Y in reality.

A value function which says ~"maximise the number of real paperclips the world model (as it currently exists) predicts there will be in the future" would have a better chance of leading to lots of real paperclips, but perhaps it's still missing something, turns out steering cognition is hard.  If the search evaluates wirehead-y plans, it will see that according to its current, uncorrupted world model, that the plan leads to very few real paperclips, and so doesn't implement it.

"Value function" is a description of the system's behavior and so the Orthogonality Thesis is about possible descriptions: if including “don’t break the world model” actually results in maximum utility, then your system is still optimizing your original value function. And it doesn't work on low level either - you can just have separate value function, but only call value function with additions from your search function. Or just consider these additions as parts of search function.

IMAGES

  1. Functional Analysis

    orthogonality thesis wikipedia

  2. Nick Bostrom Quote: “The orthogonality thesis Intelligence and final

    orthogonality thesis wikipedia

  3. Orthogonality: Part 2/3 "Using Orthogonality"

    orthogonality thesis wikipedia

  4. #TheAIAlphabet: O for Orthogonality Thesis

    orthogonality thesis wikipedia

  5. 2. Orthogonality Principle

    orthogonality thesis wikipedia

  6. Hyperbolic orthogonality

    orthogonality thesis wikipedia

VIDEO

  1. Explicit Orthogonal and Unitary Designs

  2. 3. Orthogonality expectation value and uncertainty principle

  3. 1.6.1 Orthogonal Operators

  4. video Orthogonality part1

  5. Gram-schmidt orthogonalization Process

  6. video Orthogonality Part3

COMMENTS

  1. Orthogonality

    In mathematics, orthogonality is the generalization of the geometric notion of perpendicularity to the linear algebra of bilinear forms . Two elements u and v of a vector space with bilinear form B are orthogonal when B(u, v) = 0. Depending on the bilinear form, the vector space may contain nonzero self-orthogonal vectors.

  2. Existential risk from artificial general intelligence

    Existential risk from artificial general intelligence is the idea that substantial progress in artificial general intelligence (AGI) could result in human extinction or an irreversible global catastrophe.. One argument goes as follows: human beings dominate other species because the human brain possesses distinctive capabilities other animals lack. If AI were to surpass humanity in general ...

  3. Instrumental convergence

    Instrumental convergence. Instrumental convergence is the hypothetical tendency for most sufficiently intelligent beings (human and non-human) to pursue similar sub-goals, even if their ultimate goals are quite different. [1] More precisely, agents (beings with agency) may pursue instrumental goals —goals which are made in pursuit of some ...

  4. Existential risk from AI and orthogonality: Can we have it both ways

    The orthogonality thesis is thus much stronger than the denial of a (presumed) Kantian thesis that more intelligent beings would automatically be more ethical, or that an omniscient agent would maximise expected utility on anything, including selecting the best goals: It denies any relation between intelligence and the ability to reflect on ...

  5. Quanta Magazine

    The first is the orthogonality thesis, which states, in Bostrom's words, "Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal." The second is the instrumental convergence thesis ...

  6. PDF Nick Bostrom: Superintelligence: Paths, Dangers, Strategies

    orthogonality thesis, superintelligence is compatible with almost any final goal. The crucial consequence of the orthogonality thesis is that the possession of superin-telligence (in a manner that would enable a decisive strategic advantage) does not imply being wise or benevolent. According to the second thesis, instrumental

  7. Nick Bostrom: Superintelligence: Paths, Dangers, Strategies

    The crucial consequence of the orthogonality thesis is that the possession of superintelligence (in a manner that would enable a decisive strategic advantage) does not imply being wise or benevolent. According to the second thesis, instrumental convergence , superintelligent beings with a wide variety of final goals would pursue the same ...

  8. PDF AI, orthogonality and the Müller-Cannon instrumental vs general

    AI, orthogonality and the Müller-Cannon instrumental vs general intelligence distinction . Olle Häggström. 1. Draft, September 14, 2021 . Abstract . ... orthogonality thesis, but appears more implicitly in Yudkowsky (2008) along with the argument as a whole. Bostrom (2014) can be seen as a book -length elaboration on the argument, which has ...

  9. General purpose intelligence: arguing the orthogonality thesis

    General purpose intelligence: arguing the orthogonality thesis. (Armstrong, S. 2013. Analysis and Metaphysics, 12, 68)

  10. PDF THE SUPERINTELLIGENT W M

    This paper discusses the relation between intelligence and motivation in artificial agents, developing and briefly arguing for two theses. The first, the orthogonality thesis, holds (with some caveats) that intelligence and final goals (purposes) are orthogonal axes along which possible artificial intellects can freely vary—more or less any ...

  11. Bostrom on Superintelligence (1): The Orthogonality Thesis

    As Bostrom puts it: Orthogonality Thesis: Intelligence and final goals are orthogonal: more or less any level of intelligence could in principle be combined with more or less any final goal. We need to unpack this definition in a little more detail. We'll start with the concept of "intelligence". As noted, Bostrom does not mean to invoke ...

  12. The Orthogonality Thesis & Ontological Crises

    The Orthogonality Thesis & Ontological Crises. In talking about AI over the last few years Nick Bostrom and Stuart Armstrong have very successfully popularized a more formal and nerdy re-statement of the Humean claim that values and rationality are orthogonal. I generally like to refer to their Orthogonality Thesis as the most rigorous ...

  13. PDF Analysis and Metaphysics

    Orthogonality thesis: the idea that the final goals and intelligence levels of artificial agents are independent of each other. This paper presents arguments for a (narrower) version of the thesis. It proceeds through three steps. First it shows that superintelligent agents with essentially arbitrary goals can exist in our universe -

  14. A caveat to the Orthogonality Thesis

    The orthogonality thesis states that an agent can have any combination of intelligence and goals. It is one of the core assumptions of alignment research. Sharp left turn. The sharp left turn is a hypothesized event, where the capabilities of an AI suddenly generalize to new domains without its alignment capabilities generalizing.

  15. Orthogonality (mathematics)

    Orthogonality (mathematics) In mathematics, orthogonality is the generalization of the geometric notion of perpendicularity to the linear algebra of bilinear forms . Two elements u and v of a vector space with bilinear form B are orthogonal when B(u, v) = 0. Depending on the bilinear form, the vector space may contain nonzero self-orthogonal ...

  16. Nick Bostrom's Home Page

    Presents two theses, the orthogonality thesis and the instrumental convergence thesis, that help understand the possible range of behavior of superintelligent agents—also pointing to some potential dangers in building such an agent. [Minds and Machines, Vol. 22, No. 2 (2012): 71-85] [translation: Portuguese] Whole Brain Emulation: A Roadmap

  17. Intelligence and Stupidity: The Orthogonality Thesis

    Can highly intelligent agents have stupid goals?A look at The Orthogonality Thesis and the nature of stupidity.The 'Stamp Collector' Computerphile video: htt...

  18. Orthogonality Thesis

    The Orthogonality Thesis states that an agent can have any combination of intelligence level and final goal, that is, its final goals and intelligence levels can vary independently of each other. This is in contrast to the belief that, because of their intelligence, AIs will all converge to a common goal. The thesis was originally defined by Nick Bostrom in the paper "Superintelligent Will ...

  19. General Purpose Intelligence: Arguing the Orthogonality Thesis

    ABSTRACT.In his paper "The Superintelligent Will," Nick Bostrom formalized the Orthogonality thesis: the idea that the final goals and intelligence levels of artificial agents are independent of each other. This paper presents arguments for a (narrower) version of the thesis. It proceeds through three steps. First it shows that superintelligent agents with essentially arbitrary goals can exist ...

  20. Self-Reference Breaks the Orthogonality Thesis

    Frontpage. Self-Reference Breaks the Orthogonality Thesis. 16th Feb 2023. One core obstacle to AI Alignment is the Orthogonality Thesis. The Orthogonality Thesis is usually defined as follows: "the idea that the final goals and intelligence levels of artificial agents are independent of each other". More careful people say "mostly independent ...

  21. Wikipedia

    Moved Permanently. The document has moved here.

  22. Nick Bostrom

    Nick Bostrom (/ ˈ b ɒ s t r əm / BOST-rəm; Swedish: Niklas Boström [ˈnɪ̌kːlas ˈbûːstrœm]; born 10 March 1973 in Sweden) is a philosopher known for his work on existential risk, the anthropic principle, human enhancement ethics, whole brain emulation, superintelligence risks, and the reversal test.He was the founding director of the now dissolved Future of Humanity Institute at the ...

  23. Orthogonality principle

    Orthogonality principle for linear estimators The orthogonality principle is most commonly used in the setting of linear estimation. [1] In this context, let x be an unknown random vector which is to be estimated based on the observation vector y .