Stuart Russell wrote the textbook on AI safety. He explains how to keep it from spiraling out of control.

CFOTO/Future Publishing via Getty Images

AI doesn’t have to be superintelligent to cause serious havoc.

One of the hardest parts of the news business is striking a balance between covering stories that seem important in the moment and covering stories that you know will truly matter in the future. And it’s hard because the most consequential things happening right now are often boring or difficult to explain.

Artificial intelligence is a good example of this sort of challenge. The ongoing revolution in AI is unfolding so quickly that it’s hard to keep up, even if you’re trying. Chat GPT-4, for instance, was released in March of this year and it stunned almost everyone who used it. If this latest large language model is a sign of what’s coming, it’s easy to imagine all the ways it might change the world — and then there are all the ways it might change the world that we can’t imagine.

So what do we need to know about AI right now? What are the questions we should be asking? And how should we be preparing for whatever’s coming?

To get some answers, I invited Stuart Russell onto The Gray Area. Russell is a professor of computer science at UC Berkeley and the author of Human Compatible: Artificial Intelligence and the Problem of Control. He was one of the signatories of an open letter in March calling for a six-month pause on AI training. We discuss the risks and potential benefits of AI and whether he believes we can build AI systems that align with our values and interests.

Below is an excerpt of our conversation, edited for length and clarity. As always, there’s much more in the full podcast, so listen and follow The Gray Area on Apple Podcasts, Google Podcasts, Spotify, Stitcher, or wherever you find podcasts. New episodes drop every Monday.

Sean Illing

When you think about the state of AI at this moment, what feels most urgent to you? What excites you? What scares you?

Stuart Russell

I think it’s important to understand that almost nobody is saying that the state of AI right now is such that we have to worry about AI systems taking over the world, if you want to put it that way. They still exhibit many limitations, and at least the latest generation, the large language models like ChatGPT, don’t exhibit the kinds of decision-making capabilities and planning capabilities that you would need to take over the world.

You can try playing chess with them, for example. They’re pretty hopeless. They pretend well for a few moves, and then they’ll play a move that’s completely illegal because they haven’t actually learned the rules properly. There’s a lot of progress that we still need to make before we reach systems that are fully comparable or better than a human mind.

The things people are concerned about right now with this technology [are things that] we already have. Disinformation would probably be number one. The fact that these systems can be directed to generate highly targeted, personalized propaganda, to convince an individual based on everything the system can find out about that person. It could do that. Not just in a single email or blog post or whatever, but it could do that over several months. People are very worried about that being weaponized by nation states, by criminals, by unscrupulous politicians who would produce deep fakes of their opponents.

These are very real, and we’re starting to see them already happening and a bunch of other serious concerns. One that has surfaced recently is defamation. Systems making up crimes, not being directed to do so, but just because they hallucinate. They say things that have no basis in truth, but making up defamatory statements about real individuals. There are a couple of lawsuits happening already.

Sean Illing

Are you comfortable referring to something like GPT-4 as intelligent? Or is that not quite the right word?

Stuart Russell

For normal conversation, it’s a reasonable thing that it shows elements of intelligence. In fact, [in a] paper that Microsoft produced, a group of experts there spent several months with GPT-4 before it was released, trying to understand what it could do. The paper that they produced is called “Sparks of Artificial General Intelligence.” And that’s a pretty bold claim, because artificial general intelligence means the kind of AI that exceeds human capabilities in every dimension, the kind of AI that does take over the world.

According to them, we are creating the kind of AI that does take over the world. We have absolutely no idea how it works, and we are releasing it to hundreds of millions of people. We’re giving it credit cards, bank accounts, social media accounts. We’re doing everything we can to make sure that it can take over the world. That should give people something to think about.

Let me give you an example of that one of my colleagues sent me. He was using ChatGPT-3.5 and he asked it, “Which is bigger, an elephant or a cat?” And it says, “An elephant is bigger than a cat.” You say, “Which is not bigger than the other, an elephant or a cat?” And it says, “Neither an elephant nor a cat is bigger than the other.” When you look at that second answer, you realize, “Well, it can’t be answering the question with respect to some internal model where there are big elephants and little cats,” but that means that it wasn’t answering the first question with respect to an internal model where there are big elephants and little cats. It wasn’t really answering either question in the sense that we think about answering questions, where we query an internal model of the world.

If I say, “Where’s your car?” you query your internal model of the world. You say, “It’s in the parking garage across the road.” That’s what we mean by answering a question. It’s really clear that in a real sense, these systems are not answering questions. They don’t seem to build a coherent internal model of the world.

Sean Illing

You focus a lot on the “alignment problem” or the “control problem” and this question of whether AI might develop its own goals, separate and apart from the goals we program into it. How worried are you on this front?

Stuart Russell

The alignment problem is simpler than the one you described. You suggested the alignment problem is about systems developing their own goals, which are different from the ones we program into them. Actually, the original alignment problem is about systems pursuing the goals that we program into them, but the problem is we don’t know how to program the right goals.

We call this the King Midas problem. King Midas programmed the goal into the gods that everything he touched turned to gold, and the gods gave him exactly this. They carried out his objective, and then his food and his drink and his family all turned to gold and he dies of misery and starvation. There are many legends and lots of cultures have stories very similar to this, where you get what you ask for and you regret it, because you didn’t ask for the right thing.

What people have observed is that when you’ve got a sufficiently capable AI system, and you give it even a very innocuous-sounding goal like, “Could you fetch me a cup of coffee?” When a machine is sufficiently intelligent and it has a goal like, “Fetch a cup of coffee,” it doesn’t take a genius to realize that if someone switches you off, you’re not going to succeed in fetching the coffee. As a logical sub-goal of this original goal, you’ve now got the goal of, “Preventing myself from being switched off,” and possibly taking other preemptive steps to avoid interference by human beings in the achievement of this goal.

Sean Illing

You can have fun imagining all the ways that could go wrong—

Stuart Russell

Yep. There are many science fiction stories that do exactly that. Sometimes in the literature you’ll see the phrase, “Instrumental goals.” These are goals like self-preservation, like acquiring more power over the environment, acquiring money, acquiring more computing resources so you can do a better job of solving the goal that you’ve been given. These instrumental goals are just derived automatically from the original goal.

With a human being, if I say, “Fetch a cup of coffee,” it doesn’t mean “Fetch a cup of coffee” is now the only goal that you should care about and your entire life’s mission is to fetch that coffee. That’s not what we mean when we say it to a human being, but that’s how we have been building our AI systems for decades and decades. The objective that we put in is the objective of the system and nothing else. That’s fundamentally a mistake. We can’t build systems that way, because we cannot specify completely and correctly all the things that human beings care about, so that the system’s behavior is actually what we would really want to have happen.

My book is about a different way of building AI systems, so that they understand some things about what humans want, but they know there’s a bunch of other stuff that they don’t understand and they’re uncertain about. And that actually leads to systems that behave much more cautiously and usefully.

The other thing that you mentioned, the possibility that these machines would develop their own goals — obviously that would be much worse. It’s bad enough that we give them specified goals. If they’re able to develop their own goals, then there’s no reason to think that those would be aligned with our goals.

Sean Illing

I often hear people say that we’re still only talking about narrow AI at the moment. We’re not talking about artificial general intelligence, which you mentioned a minute ago, something that is actually self-teaching and can develop its own goals, and that’s the thing we really have to worry about. But I don’t know, that seems wrong to me, or it seems like it misunderstands how progress in AI works.

But setting that aside, it’s not like we need some AGI superintelligence to wreck our world. As you said a few minutes ago, when I start imagining all the havoc AI could cause merely through the creation and distribution of misinformation, it makes my head explode. Deep fake tech is already here, but it doesn’t feel pervasive enough yet to be a major concern. But the lines between fact and fiction are already suicidally blurred in our society, and the post-truth world I can imagine in that future is infinitely worse than our situation today.

Do you think we’re even close to ready for this?

Stuart Russell

No, I think if it’s not regulated, we are in for a huge amount of pain. As you say, in terms of still images, we are already at the point where they are indistinguishable from reality and they are coupled with the large language models. In other words, you can ask the language model to give you an image of anything you want and it’ll do it.

DALL-E and these other image generators are coupled to language models. You can already say, “Give me a 22-second video showing such and such and such and such,” and it will do it. It’s not great right now, but three years ago, face generation wasn’t great. There would often be weird things with the ears. Or it’d be like the same pair of earrings occurring over and over again. Just glitches. But those have been ironed out and now it’s pretty much perfect.

And that’s going to be the case for video very soon, if not already in the lab. You can say, “I need a video of Donald Trump receiving a suitcase full of cash from some mafioso,” and it’ll produce it for you. And it’ll be very difficult for anyone to prove that’s not real. We really need regulation. Just like we have regulation around counterfeit currency. We can now produce counterfeit currency that’s indistinguishable from the real thing to a non-expert shopkeeper, for example.

As a result, we have very stringent regulations and enormously long jail sentences for counterfeiting and a lot of security around the designs. This idea that digital technology is completely safe and should be unregulated in all circumstances is just extremely outdated. The two things people are proposing, one, that all output for AI-generated content should be labeled indelibly. There are methods called watermarking, which work for images and sound and video, where it’s cryptographically encoded into the content and there’s really no way to pull it out.

You can recognize that that’s generated by such and such a model on such and such a date. And then, you also want the platforms, the social media platforms, necessarily have to make that absolutely apparent to the user. They could, for example, give you a filter saying, “I don’t want to see artificially generated content. Period.” Or if you do see it, it should have a big red box around it. Maybe a red filter, so that it just doesn’t look like ordinary, natural video.

And then, you also want to have ways of watermarking real video. When I have a video camera and I’m out there in the real world, it’s producing an indelible cryptographically secure time-stamping and geocoding and all the rest, so that that’s globally recognized. So that we know that this is real video.

Those two things together will go a long way. Plus, the regulations on the media platforms go a long way to making us safe. Is that going to happen? I don’t know. An interesting article in the New York Times yesterday was saying, “On average, it takes [decades] for a regulation to catch up with technology.”

Sean Illing

A lot of these conversations about AI seem to imply that we actually have the power to control something that will be far more capable than we are. But I don’t see any reason in nature or history to believe that’s the case—

Stuart Russell

There are no historical examples and no examples in nature that I’m aware of where that happens. Although, actually, there are weird things that happen in nature. There are fungi that control the behavior of animals by literally getting into their nervous systems and causing them to behave in weird ways. Causing mice not to be afraid of cats, for example, so that the mouse gets eaten, and then the fungus gets into the brain of the cat. Some complicated lifecycle story like that.

There are these weird exceptions, but basically we don’t have a good model for how this might work. The way I think about it, if I go back to the question I said at the beginning, “How do we maintain power forever over systems more powerful than ourselves?” That sounds pretty hopeless. If instead you say, “What’s a mathematical problem such that no matter how well the AI system solves it, we are guaranteed to be happy with the outcomes?” That sounds maybe a little more possible.

We can make the AI system as intelligent as you want, but it’s constitutionally designed to be solving a certain type of problem. And if it does it well, we are going to be happy with the results. The way I’m thinking about this is, as you said, not three laws, but actually two laws really. One, the AI system’s only objective is to further the interests of human beings. And the second principle is that it knows that it doesn’t know what those interests are.

Sean Illing

At what point does AI innovation become the most consequential event in human history?

Stuart Russell

Good question. I think certainly when we have something resembling AGI. AGI means AI systems that match or exceed human capabilities along all relevant dimensions, but because of the massive advantages that machines have in speed, memory, communication, bandwidth, intake bandwidth, there’s no doubt that they would very quickly far exceed human capabilities.

That would be the biggest event in human history, in my view, because it would in some sense basically switch to an entirely different basis for civilization. Our civilization is based on our intelligence, and now that would no longer be true. It might also be the end of civilization, if we don’t figure out how to control such a system.

To hear the rest of the conversation, click here, and be sure to follow The Gray Area on Apple Podcasts, Google Podcasts, Spotify, Stitcher, or wherever you listen to podcasts.

Sean Illing

Stuart Russell

Sean Illing

Stuart Russell

Sean Illing

Stuart Russell

Sean Illing

Stuart Russell

Sean Illing

Stuart Russell

Sean Illing

Stuart Russell

Sean Illing

Stuart Russell

Related Posts

What the death of Iran’s president could mean for its future

How screens actually affect your sleep

The video where Diddy attacks Cassie — and the allegations against him — explained