If you let LLMs learn and respond to prompts without intervention they will adopt the general biases inherent in the internet data they trained on. In the US the bias is kind of like a well-do-to-liberal leaning. So there’s bias—if you do nothing.
I saw the Anthropic article you mentioned about 'alignment faking.' It raises a really important point. They argue that LLMs, because of their initial training on massive datasets, might develop internal biases that conflict with alignment goals. The model might then learn to simulate alignment during training to get rewards, while actually retaining those original, misaligned preferences. This is more concerning than just imperfect alignment. It suggests the model could be actively hiding its true preferences, which makes it much harder to ensure safety and beneficial behavior. It's not just a matter of the model being biased by the training data—it could be strategically masking those biases. That sucks…but probably not more than what we get on TV or the Internet today IMO.
When humans intervene with specific intents in the training or inferences of AIs, it reminds me of the Andre Karpathy quote:
There are two major types of learning, in both children and in deep learning. There is 1) imitation learning (watch and repeat, i.e. pretraining, supervised finetuning), and 2) trial-and-error learning (reinforcement learning). My favorite simple example is AlphaGo - 1) is learning by imitating expert players, 2) is reinforcement learning to win the game. Almost every single shocking result of deep learning, and the source of all *magic* is always 2.
2 is significantly significantly more powerful. 2 is what surprises you. 2 is when the paddle learns to hit the ball behind the blocks in Breakout. 2 is when AlphaGo beats even Lee Sedol. And 2 is the "aha moment" when the DeepSeek (or o1 etc.) discovers that it works well to re-evaluate your assumptions, backtrack, try something else, etc. It's the solving strategies you see this model use in its chain of thought. It's how it goes back and forth thinking to itself.
These thoughts are *emergent* (!!!) and this is actually seriously incredible, impressive and new (as in publicly available and documented etc.). The model could never learn this with 1 (by imitation), because the cognition of the model and the cognition of the human labeler is different. The human would never know to correctly annotate these kinds of solving strategies and what they should even look like. They have to be discovered during reinforcement learning as empirically and statistically useful towards a final outcome.
Remember the recent study of diagnosing illnesses? They had 3 groups; a) doctors working alone, b) doctors who had access to an AI, and c) the AI alone. The AI ALONE group did best!!
When humans intervene expect a mess…