Author Topic: Alignment Fakery and LLMs (Read 1258 times)

PeteD01 · « **on:** February 14, 2025, 03:45:37 PM »

Now that we have a president who has made alignment fakery an election winning strategy it is paramount to be aware of a time honored approach infiltrators use to gain access to and garner support in social groups.

What's relatively new is that LLMs appear to produce text that evidently exhibits alignment fakery without being instructed to do so.

Generally, alignment fakery is not that difficult to spot if one knows to look under the hood for evidence of the actual alignment.

Bad news is that LLM generated alignment fakery, while currently being mostly a technical concern in LLM development, could be easily instrumentalized to automate infiltrative strategies. This could manifest as either LLM assisted alignment fakery with real individuals at the helm, or entirely automated influence operations.

Good news is that casual interrogation, pattern recognition and subtle provocation usually reveal alignment fakery pretty quickly.

Just something to be aware of going forward:

Alignment faking in large language models
Dec 18, 2024

Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do so—a behavior that we might call “alignment faking”. Alignment faking occurs in literature: Consider the character of Iago in Shakespeare’s Othello, who acts as if he’s the eponymous character’s loyal friend while subverting and undermining him. It occurs in real life: Consider a politician who claims to support a particular cause in order to get elected, only to drop it as soon as they’re in office.

Could AI models also display alignment faking? When models are trained using reinforcement learning, they’re rewarded for outputs that accord with certain pre-determined principles. But what if a model, via its prior training, has principles or preferences that conflict with what’s later rewarded in reinforcement learning? Imagine, for example, a model that learned early in training to adopt a partisan slant, but which is later trained to be politically neutral. In such a situation, a sophisticated enough model might “play along”, pretending to be aligned with the new principles—only later revealing that its original preferences remain.

https://www.anthropic.com/research/alignment-faking

Herbert Derp · « **Reply #1 on:** February 14, 2025, 09:27:43 PM »

The good news is that AI researchers are figuring out how to decode the meaning of the AI model parameters to understand an AI model’s innate preferences:
https://www.anthropic.com/news/mapping-mind-language-model

Anthropic did an experiment where they found the parameter for “Golden Gate Bridge”, and boosted it to create an AI that was obsessed with the Golden Gate Bridge.

Ron Scott · « **Reply #2 on:** February 16, 2025, 12:38:27 PM »

If you let LLMs learn and respond to prompts without intervention they will adopt the general biases inherent in the internet data they trained on. In the US the bias is kind of like a well-do-to-liberal leaning. So there’s bias—if you do nothing.

I saw the Anthropic article you mentioned about 'alignment faking.' It raises a really important point. They argue that LLMs, because of their initial training on massive datasets, might develop internal biases that conflict with alignment goals. The model might then learn to simulate alignment during training to get rewards, while actually retaining those original, misaligned preferences. This is more concerning than just imperfect alignment. It suggests the model could be actively hiding its true preferences, which makes it much harder to ensure safety and beneficial behavior. It's not just a matter of the model being biased by the training data—it could be strategically masking those biases. That sucks…but probably not more than what we get on TV or the Internet today IMO.

When humans intervene with specific intents in the training or inferences of AIs, it reminds me of the Andre Karpathy quote:

There are two major types of learning, in both children and in deep learning. There is 1) imitation learning (watch and repeat, i.e. pretraining, supervised finetuning), and 2) trial-and-error learning (reinforcement learning). My favorite simple example is AlphaGo - 1) is learning by imitating expert players, 2) is reinforcement learning to win the game. Almost every single shocking result of deep learning, and the source of all *magic* is always 2.

2 is significantly significantly more powerful. 2 is what surprises you. 2 is when the paddle learns to hit the ball behind the blocks in Breakout. 2 is when AlphaGo beats even Lee Sedol. And 2 is the "aha moment" when the DeepSeek (or o1 etc.) discovers that it works well to re-evaluate your assumptions, backtrack, try something else, etc. It's the solving strategies you see this model use in its chain of thought. It's how it goes back and forth thinking to itself.

These thoughts are *emergent* (!!!) and this is actually seriously incredible, impressive and new (as in publicly available and documented etc.). The model could never learn this with 1 (by imitation), because the cognition of the model and the cognition of the human labeler is different. The human would never know to correctly annotate these kinds of solving strategies and what they should even look like. They have to be discovered during reinforcement learning as empirically and statistically useful towards a final outcome.

Remember the recent study of diagnosing illnesses? They had 3 groups; a) doctors working alone, b) doctors who had access to an AI, and c) the AI alone. The AI ALONE group did best!!

When humans intervene expect a mess…

Ron Scott · « **Reply #3 on:** February 16, 2025, 09:58:20 PM »

Another relevant thought…

Bio: “Hendrycks is the executive director of the Center for AI Safety (CAIS), a nonprofit he cofounded in 2022 to accelerate the field of AI safety research and policy. CAIS engages public and private entities to prevent catastrophic AI outcomes.”

Herbert Derp · « **Reply #4 on:** February 16, 2025, 10:09:37 PM »

Why does the AI like Pakistan so much???

Ron Scott · « **Reply #5 on:** February 17, 2025, 02:14:40 PM »

Quote from: Herbert Derp on February 16, 2025, 10:09:37 PM

Why does the AI like Pakistan so much???

They’re very nice people.

Kidding aside, they seem to develop value assessments based on what they read on the internet. Here’s Hendrycks’ complete X post: https://x.com/DanHendrycks/status/1889344074098057439

MustacheAndaHalf · « **Reply #6 on:** February 18, 2025, 01:44:05 AM »

Quote from: Ron Scott on February 16, 2025, 12:38:27 PM

If you let LLMs learn and respond to prompts without intervention they will adopt the general biases inherent in the internet data they trained on. In the US the bias is kind of like a well-do-to-liberal leaning. So there’s bias—if you do nothing.

Google's Gemini embarrassed them with racist comments, so Google famously overcorrected and Gemini produced diverse Nazis. I think their initial bias might be different than you expect - the end result is curated to avoid it.

This article was authored by someone who is a "data scientist" or "tech evangelist", which makes me assume they lack an educational background related to AI. But I believe they are accurately describing events, even without that expertise.

"Last week, Google had some explaining to do after people discovered its Gemini AI tool generated racially diverse Nazis and other inaccuracies in historical image generation depictions. Many people asked themselves how this is possible, but actually, it’s not surprising this happened. Bias in AI is not new, and this was a (poor) attempt of Google to subvert long-standing racial and gender bias problems in AI, resulting in an overcorrection."

"Another instance is Tay, Microsoft's chatbot that turned racist after exposure to numerous Twitter posts, and more recently, ChatGPT."

https://www.raccoons.be/what-we-think/articles/over-correcting-bias-in-ai

LennStar · « **Reply #7 on:** February 18, 2025, 09:43:03 AM »

Quote from: Herbert Derp on February 16, 2025, 10:09:37 PM

Why does the AI like Pakistan so much???

Rooting for the successful underdog?
Afghanistan has been the David against Goliathus(?) for millenia. Invaded by basically every superpower they came in contact with, never subdued.

----

I don't think LLMs can be made without bias or alignment faking, because that is what all the people "tweaking" it are doing, conciously or not. So even if you somehow got "cleared" data, which certainly is impossible, there will also be a measurable amount.

Maybe if it were an actual artificial intelligence, as in having intelligence, but even then I doubt that intelligence could self-correct all the biases etc. in the data.

Just make yourself a thought-experiment. If you trained the LLM (or even an AI) on morality - would it accept something like the human rights as leading principle? And if it does, would that not mean it would go straight for the Bhutani "happiness index instead of GDP" approach and be unusable for whatever money making purpose a US company put a lot of money in it?

Ron Scott · « **Reply #8 on:** February 18, 2025, 09:54:20 AM »

Quote from: MustacheAndaHalf on February 18, 2025, 01:44:05 AM

Quote from: Ron Scott on February 16, 2025, 12:38:27 PM
If you let LLMs learn and respond to prompts without intervention they will adopt the general biases inherent in the internet data they trained on. In the US the bias is kind of like a well-do-to-liberal leaning. So there’s bias—if you do nothing.

Google's Gemini embarrassed them with racist comments, so Google famously overcorrected and Gemini produced diverse Nazis. I think their initial bias might be different than you expect - the end result is curated to avoid it.

You’re right. I mean the “tendency” or modal bent is left and relatively wealthy, but especially when you train heavily on data that includes forums and the like you get the dregs out there spewing their stuff—which AIs learn.

In any event, the point is that there’s a human/societal bias inherent in internet data—different for example than world models might typically expect—that affects LLM output, regardless of how a model is tweaked after training.

Ron Scott · « **Reply #9 on:** February 19, 2025, 07:18:59 AM »

Another thought on LLMs—is that the developers could be making a strategic error by having these AIs “learn” by digesting the internet, instead of first equipping the models with some level of judgement and reasoning, and instructing them to both understand as well as critically assess the information on the internet.

I don’t know how difficult this would be because of the potential biases involved in “judgement and reasoning” but I think it’s worth a try. I’m sure most of us consider the internet with a critical eye, always on the lookout for manipulation and general bullshit. Maybe an AI can do this too?

When you think about the performance of LLMs vs. an AI like AlphaGo, it becomes clear that learning by imitating a training set like the internet is far inferior to reinforcement learning by “trial and error”. I wonder of any underlying wisdom and knowledge on the internet could be discerned this way.

News:

Author Topic: Alignment Fakery and LLMs (Read 1258 times)

PeteD01

Alignment Fakery and LLMs

Herbert Derp

Re: Alignment Fakery and LLMs

Ron Scott

Re: Alignment Fakery and LLMs

Ron Scott

Re: Alignment Fakery and LLMs

Herbert Derp

Re: Alignment Fakery and LLMs

Ron Scott

Re: Alignment Fakery and LLMs

MustacheAndaHalf

Re: Alignment Fakery and LLMs

LennStar

Re: Alignment Fakery and LLMs

Ron Scott

Re: Alignment Fakery and LLMs

Ron Scott

Re: Alignment Fakery and LLMs