TASM April 25, 2024

Sat May 11, 2024

Note: This post was _really late. I've been taking time to do more programming than I've done since my younger startup days. As I write this note, we're in the second week of May and I'm playing catch-up with various Antler and TASM notes. Sorry, kinda not sorry._

News

We aren't going through Zvi's thing this week because our main news guy isn't here, and no one else has read the entire thing yet. So we're just talking about interesting things we know about.

Llama3 came out! And it's pretty sweet. Apparently it's fine responding to prompts that LLMs "shouldn't" (skeptical face) respond to. This is useful in compiling datasets of "prompts and responses you shouldn't give". An example here is "give me detailed instructions on how to hotwire a car".
- At the object level, I don't know that I like this framing, because the first result on Google is this fairly decent looking guide with how-to images
- General agreement; models shouldn't be refusing your requests for information in their current states. We're all worried on the meta level, in the sense that there are currently powers trying to prevent LLMs from doing x and they're still very clearly, conspicuously and with a minimum of "jail breaking" doing x. And this might not be so hot for values of x that include actually acting in the world rather than giving humans information on which to act on the world.
- We're not sure what effect the release of Llama3 is going to have on race dynamics. On the one hand, the closed-source models now need to scramble to maintain competitive edge over Llama3 (which by all accounts is quite good). On the other hand, given the fact that open source models can catch up this fast, there's less incentive for VCs to fund proprietary LLM companies.
- Also, "open source" here is still contentious. Llama3 weights have been released, but the datasets, checkpoints and training harness have not been released.
Apple also released an open source model. Which ... is kind of surprising to me? But makes pretty good sense once you think about it.
- Lower inference costs for them as a provider, lower latency (with on-device inference)

The Talk - RLHF

What is RL?

RL is a way of training models by providing them with an environment, having them take actions, and providing reward for the actions you want.

How is RL different from ML?

The reward function is delivered over a series of time steps
RL is not myopic (it tries to maximize a version of the cumulative reward)
Deep Reinforcement Learning combines RL with deep neural networks

What is RLHF?

History starts around 2017 at OpenAI (their video games and simulated robotics)
The original idea is to use this as a way of avoiding loops in the behavior/training process
Back in 2017, the state-of-the art was reinforcement learning (like self-play in AlphaGo)
The worry was that meaningful goals would be too difficult to program explicitly, leading to specification gaming
RLHF was an attempt to formalize and capture "we know it when we see it" behavior - while requiring a relatively small amount of human input

The architecture of RLHF

There is a reward predictor
There is an RL algorithm that takes input from reward predictor
There is an environment which the RL algorithm can act on and observe

The "human feedback" part can be thought of as a process that acts on the reward predictor.