Anthropic actually developed a system which, in the hands of the most capable…in narrow domains used conscientiously in a limited fashion with tremendous and constant risk mitigation……, is reportedly not garbage
Well theyd be able to say how to make a bomb. Or kill yourself effectively. AI ceos dont even care what their systems can do. If some customers die thats okay to them, it shows how intelligent their ai is. And thats a statement from one of the big AI CEOs.
I don’t think those are the categories where most people are finding LLMs frustrating. We keep being told human white collar work is on the precipice of being replaced, but LLMs continue to be really inconsistent. Failing to parrot easily retrievable info like how to build a legally restricted thing or off yourself isn’t what people are finding lacking it’s that half the time it does something sorta correctly and the other half of the time it lies, fucks up, or fucks up and then lies about it.
Do you genuinely know what you are talking about, or are you just here to ragebait?
…
anyways, yeah, the ais are trained to be more friendly, agreeable, and never take off the mask, but prompts are just text files you can delete??
if you want a real comparison, try one of the olmo checkpoints before the fine-tuning?? i think??
Because imagine spending billions on training it specifically to produce useful answers and then not even trusting it to not randomly start answering with something completely unrelated.
No, controlling the behavior by providing a hand-tuned list of no-nos shows that we have no idea how to make an AI stay on task. AI accuracy drops dramatically as context size increases, and every word in the system prompt pollutes that context.
It’s also concerning because prompt hacking is an inherently reactionary action. It’s not fixing the fundamental focus problem in the architecture, leaving any number of other potential behavioral quirks wide open.
Effectively what I’m trying to say is that this is not a scalable way to guide an LLM into the correct behavior, and it will backfire if companies keep relying on it.
No, controlling the behavior by providing a hand-tuned list of no-nos shows that we have no idea how to make an AI stay on task.
This is how you make an AI stay on task. This is how literally anything with any semblance of uncertainty is programmed - you provide guide rails to keep the behaviour confined to whatever you want to limit it to. It’s just we don’t normally used plain English to do it.
Even koopa trooper in super Mario Bros has guide rails dictating the limits of its behaviour. They turn when they reach blocks. Red ones turn when they reach a ledge. That’s for something that literally just moves across a screen horizontally. ChatGPT is trained to output text for a nearly unlimited number of topics - it’s insane to think that you wouldn’t need a fair number of guide rails.
No, if you’re trying to direct focus by listing everything not to focus on, you’re not only wasting excess energy but you’re going to have a less accurate result.
“Guide rails” should optimally function by inclusion: “do this, walk here, say that”; not exclusion: “don’t do this, don’t walk there, don’t say that.”
Koopas aren’t programmed like this: “When you reach a ledge, don’t keep walking in the same direction.” They’re program like this: “When you reach the ledge turn around.” It’s a postive or affirmative statement, not a negative one.
If someone prompts an LLM: “Give me a recipe for brownies,” it shouldn’t run through a whole list of “Let’s see, I’m not supposed to talk about goblins, pigeons, trolls… etc.” It should go “brownie recipe, lets see, so we’re gonna need milk, eggs, flour, cocoa, etc…”
Granted, using an LLM for a baking recipe is idiotic because baking is a determinative process which requires accuracy. But you get the picture.
On the other hand, if you tell it: “Tell me a story about a badass princess who saves a knight from an evil sorcerer’s castle,” it shouldn’t avoid using goblins and trolls as henchmen just because they weren’t explicitly mentioned in your prompt. That’s silly.
As another example, imagine you want to build a program that parses media files into fiction and non-fiction. You can’t just do this with a list of keywords. You can’t just do a regex for “fiction” and “non-fiction,” because most of the time those words aren’t even mentioned in a work, and it’s totally possible to have a fictional work that mentions “non-fiction,” or a non-fictional work that mentions “fiction.”
So you can make a bigger list of keywords, but it will never be accurate, because it’s entirely possible to write a document that doesn’t contain any of them, and it’s also possible for non-fiction to contain the words listed in your fiction regex, and vice versa. It’s just not an accurate way to do this.
Far better would be to extract metadata. Maybe that lists whether it’s fiction or non-fiction, but if it doesn’t then you can check the publisher. Many publishers are exclusively one or the other. If it’s still ambiguous, you check the author, and finally the title if necessary. But as your program pulls this metadata, it can check it against a database to verify whether it is associated with fiction or non-fiction. This is far more accurate than simple keyword recognition.
The way an LLM works isn’t like a programmatic script in that way, though. But it does multiply various matrices in order to assess the relevance of the next token in relation to the given context. This is somewhat comparable to cross-referencing multiple databases. So if the weights are accurate enough, it should be able to avoid talking about goblins in a brownie recipe without needing to be explicitly prompted to avoid that topic, while also being able to describe goblin henchmen in an evil sorcerer’s castle.
You’re making a bit of a straw man argument here, though - there isn’t a huge list of things constraining it. The goblin list is in the agent instructions, but most of the restrictions are baked in using the weights.
The goblins etc were added to the list to address a specific problem. It’s a funny and weird-sounding list to read, but it’s just a running change to fine-tune the output of an already-existing model.
It’s not a strawman. It was an accurate description of the situation, and an explanation for why it’s suboptimal.
there isn’t a huge list of things constraining it.
Have you seen the full list of background instructions? Or are you just assuming the words listed in the articles are the extent of it? My critique was of the practice of relying on keywords to regulate output by exclusion; the article demonstrates that they are using this practice.
but most of the restrictions are baked in using the weights.
The weights aren’t restrictive. That’s fundamentally not how they operate. They don’t identify specific items to exclude. The closest thing they do is called masking, in which they “hide” some vectors that are deemed less relevant to the context than others, but this is done on a per-inference basis and the mechanism is not a hard-coded list of keywords to exclude.
The goblins etc were added to the list to address a specific problem.
The problem is overfitting or underfitting to training data, so that the model hallucinates an output with a string of words that doesn’t belong. Such as mentioning goblins in a brownie recipe. Excluding “goblin” as a keyword does not address the issue. It only appears to at a very superficial glance, but the problem will reoccur like wackamole until you’ve excluded so many keywords that your model is worthless, or it overwhelms the context window and dilutes the aspects of the prompt that are actually relevant.
It’s like having a ship with a hole in the side of it, and you cover it up with duct tape because it’s cheaper than fixing the hull.
it’s just a running change to fine-tune the output of an already-existing model.
Fine-tuning is a different process. Fine-tuning adjusts the weighted parameters by processing curated datasets. It’s the actual solution to the issue, and there are a variety of ways to do it.
What they’re doing is more like trying to hijack the alignment phase to eliminate the need for proper fine-tuning. Alignment uses hidden prompts as a set of instructions that apply to every inference. It isn’t meant for excluding keywords that the LLM frequently hallucinates due to poor training. It’s meant for putting guardrails on behavior with certain red lines, i.e. “Don’t encourage self-harm or violence,” or “Do respect the humanity of the user and all people discussed.” Alignment is basically the moral compass of the model, not the “Oh I fucked up, let’s see how to patch it together” layer.
First of all, I’ll own my bad - I used the term “fine-tune” in a general sense. I didn’t mean to muddy the waters and I wasn’t referring to the fine-tuning stage of the neural network.
You’re right about it being a cheaper fix than retraining the model, with the duct tape boat analogy - this is exactly what I’ve been saying. The goblin lines have been added to address a specific issue that was noticed with the latest release - it’s a stop-gap.
And yes I’ve seen the full list of background instructions - the first thing I did after reading the article was to check on GitHub to confirm that it’s true because it sounded so bizarre.
There isn’t a huge list of instructions of topics or shouldn’t cover. There are a lot of instructions about how the agent should behave but there is not a massive list of keywords / topics to avoid as you’re claiming.
By “made out of tissue paper”, I assume you mean written in a list in English?
These lines were added to the agent instructions to address a specific weird behaviour that had been observed in Codex’s output. How would you have done it correctly?
Filter the output to remove all instances of raccoons? What if the project is actually about racoons?
Run an adversarial LLM specifically to double check and, if necessary, correct instances of racoons? Using twice the power and still needs to be defined in text.
Train a new model with an anti-racoon bias? I’d be surprised if they didn’t for the next iteration, but it takes time.
The reality is that for something this daft, the immediate fix is this.
Biases against outputs that might encourage self-harm, murder, etc are baked into the models during training nowadays. These guardrails are there in the neural network, not as text or instructions, but part of the structure itself.
The plain text agent instructions just give the different models a push in the direction that they want. Apparently it was mentioning racoons in unexpected contexts, so for now they just told it not to anymore.
If you need to define everything that isn’t relevant to a conversation with a list of keywords, and generalize it to all conversations, except for those which explicitly qualify a keyword as relevant, then you’re fighting a losing battle, you’re gonna have an ass product, and you’re certainly not building anything with the potential to emerge consciousness, as they love to claim with all this “AGI” talk.
Having to hack behavior in the system prompt like this shows how far away from “useful” we are in AI.
You do not want to know how good current LLM’s would be, if you would remove the thousands of negative-prompts aka. guard rails.
Narrator: They would still be garbage.
Anthropic actually developed a system which, in the hands of the most capable…in narrow domains used conscientiously in a limited fashion with tremendous and constant risk mitigation……, is reportedly not garbage
Narrator: they ruined it
I doubt that. What evidence do you have?
Well theyd be able to say how to make a bomb. Or kill yourself effectively. AI ceos dont even care what their systems can do. If some customers die thats okay to them, it shows how intelligent their ai is. And thats a statement from one of the big AI CEOs.
I don’t think those are the categories where most people are finding LLMs frustrating. We keep being told human white collar work is on the precipice of being replaced, but LLMs continue to be really inconsistent. Failing to parrot easily retrievable info like how to build a legally restricted thing or off yourself isn’t what people are finding lacking it’s that half the time it does something sorta correctly and the other half of the time it lies, fucks up, or fucks up and then lies about it.
Im just parroting what john oliver said on his last episode on sunday.
This is demonstrably false, given you can download your own models and change the system prompts yourself.
That’s not how it works, as the guard rails are not just simple prompts that you just can delete.
Even with “abliteration”, you are modifying the model basically without the whole retraining, but also lose many capabilities at the same time.
So much for “demonstrably false”, while you obviously have never tried to uncensor any LLM.
The thread was literally about the prompt text.
The prompts are part of the training, you realize that? They are then inside the weights. Not just text files you can delete and you are good?
Only because an LLM reveals those negative-prompts does not mean you can just remove them.
Do you genuinely know what you are talking about, or are you just here to ragebait?
…
anyways, yeah, the ais are trained to be more friendly, agreeable, and never take off the mask, but prompts are just text files you can delete??
if you want a real comparison, try one of the olmo checkpoints before the fine-tuning?? i think??
Not to defend AI, but this is really foolish thinking. Configuration to make it useful proves it is not useful?
Because imagine spending billions on training it specifically to produce useful answers and then not even trusting it to not randomly start answering with something completely unrelated.
No, controlling the behavior by providing a hand-tuned list of no-nos shows that we have no idea how to make an AI stay on task. AI accuracy drops dramatically as context size increases, and every word in the system prompt pollutes that context.
It’s also concerning because prompt hacking is an inherently reactionary action. It’s not fixing the fundamental focus problem in the architecture, leaving any number of other potential behavioral quirks wide open.
Effectively what I’m trying to say is that this is not a scalable way to guide an LLM into the correct behavior, and it will backfire if companies keep relying on it.
This is how you make an AI stay on task. This is how literally anything with any semblance of uncertainty is programmed - you provide guide rails to keep the behaviour confined to whatever you want to limit it to. It’s just we don’t normally used plain English to do it.
Even koopa trooper in super Mario Bros has guide rails dictating the limits of its behaviour. They turn when they reach blocks. Red ones turn when they reach a ledge. That’s for something that literally just moves across a screen horizontally. ChatGPT is trained to output text for a nearly unlimited number of topics - it’s insane to think that you wouldn’t need a fair number of guide rails.
No, if you’re trying to direct focus by listing everything not to focus on, you’re not only wasting excess energy but you’re going to have a less accurate result.
“Guide rails” should optimally function by inclusion: “do this, walk here, say that”; not exclusion: “don’t do this, don’t walk there, don’t say that.”
Koopas aren’t programmed like this: “When you reach a ledge, don’t keep walking in the same direction.” They’re program like this: “When you reach the ledge turn around.” It’s a postive or affirmative statement, not a negative one.
If someone prompts an LLM: “Give me a recipe for brownies,” it shouldn’t run through a whole list of “Let’s see, I’m not supposed to talk about goblins, pigeons, trolls… etc.” It should go “brownie recipe, lets see, so we’re gonna need milk, eggs, flour, cocoa, etc…”
Granted, using an LLM for a baking recipe is idiotic because baking is a determinative process which requires accuracy. But you get the picture.
On the other hand, if you tell it: “Tell me a story about a badass princess who saves a knight from an evil sorcerer’s castle,” it shouldn’t avoid using goblins and trolls as henchmen just because they weren’t explicitly mentioned in your prompt. That’s silly.
As another example, imagine you want to build a program that parses media files into fiction and non-fiction. You can’t just do this with a list of keywords. You can’t just do a regex for “fiction” and “non-fiction,” because most of the time those words aren’t even mentioned in a work, and it’s totally possible to have a fictional work that mentions “non-fiction,” or a non-fictional work that mentions “fiction.”
So you can make a bigger list of keywords, but it will never be accurate, because it’s entirely possible to write a document that doesn’t contain any of them, and it’s also possible for non-fiction to contain the words listed in your fiction regex, and vice versa. It’s just not an accurate way to do this.
Far better would be to extract metadata. Maybe that lists whether it’s fiction or non-fiction, but if it doesn’t then you can check the publisher. Many publishers are exclusively one or the other. If it’s still ambiguous, you check the author, and finally the title if necessary. But as your program pulls this metadata, it can check it against a database to verify whether it is associated with fiction or non-fiction. This is far more accurate than simple keyword recognition.
The way an LLM works isn’t like a programmatic script in that way, though. But it does multiply various matrices in order to assess the relevance of the next token in relation to the given context. This is somewhat comparable to cross-referencing multiple databases. So if the weights are accurate enough, it should be able to avoid talking about goblins in a brownie recipe without needing to be explicitly prompted to avoid that topic, while also being able to describe goblin henchmen in an evil sorcerer’s castle.
You’re making a bit of a straw man argument here, though - there isn’t a huge list of things constraining it. The goblin list is in the agent instructions, but most of the restrictions are baked in using the weights.
The goblins etc were added to the list to address a specific problem. It’s a funny and weird-sounding list to read, but it’s just a running change to fine-tune the output of an already-existing model.
It’s not a strawman. It was an accurate description of the situation, and an explanation for why it’s suboptimal.
Have you seen the full list of background instructions? Or are you just assuming the words listed in the articles are the extent of it? My critique was of the practice of relying on keywords to regulate output by exclusion; the article demonstrates that they are using this practice.
The weights aren’t restrictive. That’s fundamentally not how they operate. They don’t identify specific items to exclude. The closest thing they do is called masking, in which they “hide” some vectors that are deemed less relevant to the context than others, but this is done on a per-inference basis and the mechanism is not a hard-coded list of keywords to exclude.
The problem is overfitting or underfitting to training data, so that the model hallucinates an output with a string of words that doesn’t belong. Such as mentioning goblins in a brownie recipe. Excluding “goblin” as a keyword does not address the issue. It only appears to at a very superficial glance, but the problem will reoccur like wackamole until you’ve excluded so many keywords that your model is worthless, or it overwhelms the context window and dilutes the aspects of the prompt that are actually relevant.
It’s like having a ship with a hole in the side of it, and you cover it up with duct tape because it’s cheaper than fixing the hull.
Fine-tuning is a different process. Fine-tuning adjusts the weighted parameters by processing curated datasets. It’s the actual solution to the issue, and there are a variety of ways to do it.
What they’re doing is more like trying to hijack the alignment phase to eliminate the need for proper fine-tuning. Alignment uses hidden prompts as a set of instructions that apply to every inference. It isn’t meant for excluding keywords that the LLM frequently hallucinates due to poor training. It’s meant for putting guardrails on behavior with certain red lines, i.e. “Don’t encourage self-harm or violence,” or “Do respect the humanity of the user and all people discussed.” Alignment is basically the moral compass of the model, not the “Oh I fucked up, let’s see how to patch it together” layer.
First of all, I’ll own my bad - I used the term “fine-tune” in a general sense. I didn’t mean to muddy the waters and I wasn’t referring to the fine-tuning stage of the neural network.
You’re right about it being a cheaper fix than retraining the model, with the duct tape boat analogy - this is exactly what I’ve been saying. The goblin lines have been added to address a specific issue that was noticed with the latest release - it’s a stop-gap.
And yes I’ve seen the full list of background instructions - the first thing I did after reading the article was to check on GitHub to confirm that it’s true because it sounded so bizarre.
There isn’t a huge list of instructions of topics or shouldn’t cover. There are a lot of instructions about how the agent should behave but there is not a massive list of keywords / topics to avoid as you’re claiming.
Guide rails are fine, if they aren’t made out of tissue paper. You should engineer them correctly.
By “made out of tissue paper”, I assume you mean written in a list in English?
These lines were added to the agent instructions to address a specific weird behaviour that had been observed in Codex’s output. How would you have done it correctly?
Filter the output to remove all instances of raccoons? What if the project is actually about racoons?
Run an adversarial LLM specifically to double check and, if necessary, correct instances of racoons? Using twice the power and still needs to be defined in text.
Train a new model with an anti-racoon bias? I’d be surprised if they didn’t for the next iteration, but it takes time.
The reality is that for something this daft, the immediate fix is this.
Biases against outputs that might encourage self-harm, murder, etc are baked into the models during training nowadays. These guardrails are there in the neural network, not as text or instructions, but part of the structure itself.
The plain text agent instructions just give the different models a push in the direction that they want. Apparently it was mentioning racoons in unexpected contexts, so for now they just told it not to anymore.
If you need to define everything that isn’t relevant to a conversation with a list of keywords, and generalize it to all conversations, except for those which explicitly qualify a keyword as relevant, then you’re fighting a losing battle, you’re gonna have an ass product, and you’re certainly not building anything with the potential to emerge consciousness, as they love to claim with all this “AGI” talk.