Don't Claude Me

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 8 hours ago

Don't Claude Me

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 7 hours ago

Yes, the model reflects the biases already baked into the training data., and the pidgin example is almost certainly the model regurgitating classist, racist patterns from its corpus, not a developer explicitly telling it to mock villagers. However, the broader point here is reagarding systemic inequality showing up in AI output.

The intentional claim is based on the fact that Claude straight up refused to answer certain factual questions for users who identified as Iranian or Russian, while cheerfully answering the same questions for Americans. That can’t be hand waved away as a statistical correlation between dialect and knowledge. That’s a hard refusal trigger almost certainly put there by safety/alignment tuning, RLHF filters, or some geopolitical compliance rules nobody knows about. Someone decided that users from those countries shouldn’t get those answers.

So there are two different things happening. One is that the model has passive bias where it learns toxic associations from training data. But the other is active gating where the model is instructed, directly or indirectly, to withhold information based on user demographics. The refusal case clearly shows that there is deliberate choice in whom the model will give answers to.

And the most important aspect of all this is that we cannot reliably know what the reason for a particular behavior is because closed models make it impossible to tell which mechanism is at work. Hence why open and inspectable models are the only way to audit this stuff. The prescription of openness and local control makes sense regardless of whether the harm is passive or active.