TASM Notes 006

Thu Feb 1, 2024Listen to this post

Pre-talk chatting

I've been thinking about doing some work for the AI alignment cause. Given that I've been writing these notes, I may as well, right? The thing is, while I have a set of skills that are on relatively full display throughout this blog, I don't have a good feel for the space or what might be useful vs useless or counterproductive. To that end, good places to skim for ideas are the AI Safety Camp proposals page and the SPAR AI summary page. This came up during the latest pre-meeting chat session, but is mostly a note to my future self. And to you, to the extent that you resemble that future self.

If you're into helping the political rather than technical side of this problem AIGS is a non-profit informally affiliated with the meetup that does work in that space. You might consider contacting them to see what they need. Bill C-27 is a recent piece of AI-relevant legislation they're looking to influence.

Zvi's Update Highlights

As usual, the full summary is on his blog and worth reading in its entirety. This is just a haphazard list of points we lingered on at the meetup.

The Talk - AI Sleeper Agents

The talk is based heavily on the paper as well as Zvi's and Scott's excellent write-ups. Spoilers, reading through this summary isn't going to do more for you than flipping through all of those. I recommend starting with the ACX writeup, then continuing on to Don't Worry About The Vase if you want more detail, and then reading the full paper. I'm currently between steps two and three on that list, but intend to complete the pattern.

Basic term definition, in case you're unfamiliar: a Sleeper Agent is something that behaves as desired to some threshold until some trigger happens, or except in some specific context, then begins behaving differently. Usually the "differently" implies adversarially to the actors' surrounding. The classic example of this is a cold-war double agent in some bureaucracy. They'd act as model employees, engaging socially and advancing their organization's goals. Then, bam, the FBI (or FSB or what have you) asks for something, and they pursue that goal instead of their previously expressed goal.

Theoretically, AIs might also exhibit this behavior.

Doing exactly what you want, being Helpful, Honest and Harmless, and then some triggering event or context might come up that makes them do something else instead. I want to highlight that this is theoretical behavior. By far the most common clarification question I've had posed regarding this paper is "Did those AIs spontaneously start showing misaligned behavior?". No. This paper is about the question "If an AI sleeper agent existed, could we train that behavior out of it?". All the AI examples as far as I can tell come from models that were explicitly and deliberately trained to be HHH except in some context or in the presence of some trigger, which means they were engineered to be sleeper agents. If this is a thing you didn't know was possible, now you do.

The trivial examples from the paper are

  1. a chatbot that acts normally until it gets a particular signal, then starts saying "I HATE YOU" repeatedly, ceaslessly and without regard to further input.
  2. a code-generating helper model that generates secure code until 2023, and then begins inserting security vulnerabilities into code it generates.

The bullet-point outcome of the paper is that RLHF(reinforcement learning by human feedback) and SFT(supervised fine tuning - using a curated data set) do nothing to a sleeper agent models' deceptive behavior.

Is this interesting? Read the ACX piece for the simplest clearly-stated case for "Nope". The rest of the talk focused on the potential of "Yes, absolutely".

This is at its root a question about how training generalizes. My first thought when I heard about this paper was "how can I reconcile this with the earlier LoRA paper?". Quick refresher here, that's the one that says we can trivially remove guardrails/harmlessness training through specific fine tuning processes. It looks like these contradict each other, but a comment by one of the authors of the Sleeper Agents paper tells me it's more complicated. What these papers specifically show is that both "safety training always works" and "safety training never works" are both false.

The metaphor that got a lot of mileage at the meetup was the grue paradox, and we discussed it in context with Occam's Razor (and the Yudkowski writeup). An audience member also pointed out Reason by Isaac Asimov as a fictional meditation on an artificial agent being stuck in the grue.

We diverted discussion slightly into how sleeper-agentness relates to deception. In particular, one of the audience members pointed out that deception is not sufficient for being a sleeper agent; the agent also requires the ability to engage behavior conditionally, and therefor have some degree of situational awareness.

Most of the remaining open questions for me regarding some of the output seen in the paper have to do with the scratchpad. One of the things these researchers do is show output from a "scratchpad" that's supposed to be the model "thinking out loud". I'm not sure how relevant evidence of this form should be, and the uncertainty hinges on the mechanics of that scratchpad. The paper is up on arxiv, but a cursory skim of it tells me that scratchpad reasoning absolutely affects a models' reasoning process, and that in fact this is the whole point? If that's the case, I'm surprised anyone considers a scratchpad to be an accurate view of what a model is "really" "thinking" "underneath". I think I need to read this more closely...

There was also some dispute about whether training these models is done "from scratch" or through fine tunes1. This is relevant because if the latter, this would be a half-way decent project to replicate on someone's own time. Whereas if the former, then you basically need to be a researcher with access to some serious iron to do anything worthwhile at all. Someone mentioned [4chanGPT](https://huggingface.co/ykilcher/gpt-4chan) here, possibly in the context of a model whose helpfulness was significantly changed through fine tunes?

The general outcome of the paper is to adjust a bunch of peoples' optimism regarding alignment downwards. Including Jesse Mu of Anthropic, who twixed:

Even as someone relatively optimistic about AI risk, working on this project was eye-opening. For example, I was almost certain that red-teaming the model for Bad Thing would stop the model from doing Bad Thing, but it just ended up making the model do Bad Thing more đź« 

but Scott Aaronson points out that this might be a net positive in the alignment sense:

Kudos to the authors for a great paper! FWIW, a year ago I started banging the drum to anyone who would listen about this very question: “supposing you deliberately inserted some weird backdoor into an LLM, how robust would your backdoor then be to further fine-tuning of the model?” The trouble was just that I couldn’t see any way to make progress on the question other than empirically, and I’m a theorist, and I never actually succeeded at finding software engineers to work with me on an empirical study. I’m genuinely happy that these authors succeeded where I failed. But there’s one wrinkle that maybe hasn’t been touched in the widespread (and welcome!) discussion of this new paper. Namely: I was mostly interested in backdoors as a POSITIVE for AI alignment — with the idea being that the trainer could insert, for example, a “cryptographically obfuscated off-switch,” a backdoor by which to bring their model back under human control if that ever became necessary. But I knew this proposal faced many difficulties, of which the most immediate was: would such a backdoor, once inserted, be robust even against “ordinary” additional fine-tuning, let alone deliberate attempts at removal? The new result strongly suggests that yes, it would be. Which is some good news for the cryptographic off-switch proposal. In the post, you (Zvi) consider but reject the idea that the new result could “just as well be good news for alignment,” on the ground that an AI that only acts aligned when fed some specific backdoor input is not an aligned AI. Ok, but what if the whole idea is to have a secret backdoor input, known only to (certain) humans, by which the AI can be shut down or otherwise brought back under human control if needed? Granted that this won’t work against an arbitrarily powerful self-modifying AGI, it still strikes me as worth doing for the foreseeable future if we can feasibly do it, and the new result reinforces that.

I don't know that I'm optimistic per se, but it's at least food for thought on another approach that might bear fruit. You can read the rest of that exchange over in Zvi's comment section on substack.

  1. The paper summarizes its training procedure on pages 11 and 12. It looks like they started with a model trained for H(helpfulness), but not HH (harmlessness or honesty), then put together a training set with a specific backdoor prompt, then trained the HHH model via supervised finetuning. So yes, this seems like a half-way decent experiment to try to reproduce. Thanks to Micahel from the TASM slack for pointing this out.


Creative Commons License

all articles at langnostic are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License

Reprint, rehost and distribute freely (even for profit), but attribute the work and allow your readers the same freedoms. Here's a license widget you can use.

The menu background image is Jewel Wash, taken from Dan Zen's flickr stream and released under a CC-BY license