Friday, June 24, 2022
HomeArtificial IntelligenceHow to Make Oracle AIs Safe. The Counterfactual Oracle AI | by...

How to Make Oracle AIs Safe. The Counterfactual Oracle AI | by Hein de Haan | May, 2022


The Counterfactual Oracle AI

Photo by Max Langelott on Unsplash

In the (relatively) near future, it’s likely that some organization builds a powerful artificial intelligence (AI) that surpasses human-level intelligence. If we see “intelligence” as the ability to achieve one’s goals, such a superintelligence would by definition be unusually good at achieving its goals. This presents a problem for us humans: what if the goals of this superintelligence conflict with our goals? Note that the fate of non-human animals depends, for a large part, on humans — because humans are more intelligent than they are. Likewise, our future fate might depend on an artificial superintelligence. What if its goals are best achieved by acquiring as much resources as possible, eliminating humans as a side effect?

This may sound crazy, but many serious thinkers have warned about existential risk from AI. The problem is that resource acquisition — e.g. to build supercomputers — is instrumental in achieving almost any goal. Say we want a superintelligence to find the theory of everything. It would probably be better at finding that theory if its even smarter — and having a larger “brain” is a great way of becoming smarter. So it may attempt to convert as much material on earth into computronium (computing material) and kills us all — not because it hates us, but as a side effect.

Of course, if a superintelligence actually shares our goals — our moral values — it wouldn’t kill us. In fact, it would be great: it might finally cure cancer, eliminate other diseases, and generally bring us awesomeness. This is called aligning an AI with our values — we don’t yet know how to do this, and it’s an active area of research.

It has been suggested that in order to keep an artificial superintelligence safe, we just need to build one that doesn’t actually “do” anything; instead, we should make one that just answers questions. This is the concept of the oracle AI — but it misses something. If the oracle’s goal is answering questions, the same problems as before arise: resource acquisition is still beneficial, for example, and although the oracle may have less of an influence on the world than a “free” AI, it can still manipulate humans through the answers it gives. Since it is, after all, superintelligent, the oracle might be able to get humans to give it access to e.g. more and more computing power, which it could use to answer questions better.

The problem here is that humans aren’t safe, and can in principle be convinced (or manipulated) into doing stuff the oracle wants, because the oracle is more intelligent than they are. It’s quite worrying that a seemingly innocent goal like “answer questions” is already dangerous.

So oracle AIs are not necessarily safe — but could we make a variant that is?

Counterfactual oracle AI

As it turns out, yes! The thing is, a “standard” oracle AI might get a reward (a “point”, so to speak) every time it answers a question to the satisfaction of the human(s) asking the question. It might seem this would cause the oracle to pursue to goal of answering questions in a satisfactory manner, but it might do something more sinister: it could manipulate the humans to give it the reward, or indeed attempt to acquire computing resources (and kill us as a side effect), as discussed before.

So this specific reward doesn’t work. But what exactly is the problem? I mean, we want the oracle to give answers to our questions, right? Shouldn’t we reward it when it does?

Well, sort of. The problem is that we actually read the answers the oracle gives, and (potentially) act on the information we get. Not reading the answers would result in a safe situation, but nobody builds an oracle AI to then not read the answers it gives.

Shut the oracle down before we read its answer to a question we give it? Sure, that would be safe. The oracle wouldn’t be able to manipulate us. But also, such a design would be quite useless: the oracle wouldn’t get any rewards ever, so wouldn’t be interested in given satisfactory answers. It may as well output gibberish in that case.

What we need to do is to sometimes not read the answer. Hear me out: we create a situation where if we ask the oracle a question, there’s a small (but non-zero) probability that the answer the oracle gives is hidden from us. Instead, the answer is sent to an automated system that checks the accuracy. The oracle is then given a reward (or punishment) corresponding to the accuracy of its answer — and in all other cases, the reward is zero and the oracle terminates before we read the answer. This way, we almost always see the answer to the question we give the oracle and the oracle is motivated to give us useful answers without manipulating us. Why? Because it gains nothing by manipulating us! When we do see the answer (the “normal” case), the oracle is already shut down. Whatever we do with its answer doesn’t matter to the oracle anymore. In the rare case the oracle does get a non-zero reward, we don’t see its answer, so it literally can’t manipulate us. Giving us nonsense as answer wouldn’t be smart: the oracle knows there’s a small probability the answer is given to the checking system, in which case its reward is higher when its answer is more accurate.

With this design, we have an oracle AI that gives us answers to maximize a reward it gets in a counterfactual case — when we don’t see its answer. Since there’s always a chance its answer is checked, the oracle will be motivated to give useful answers (and not to manipulate humans).

This post is based on Safe Uses of AI Oracles by Stuart Armstrong and Xavier O’Rourke.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments