种族和性别偏见是野外不忠实思维链的一个例子

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild — LessWrong
作者:Adam Karvonen, Sam Marks    发布时间:2025-07-04 12:36:44    浏览次数:0
Summary: We found that LLMs exhibit significant race and gender bias in realistic hiring scenarios, but their chain-of-thought reasoning shows zero evidence of this bias. This serves as a nice example of a 100% unfaithful CoT"in the wild" where the LLM strongly suppresses the unfaithful behavior. We also find that interpretability-based interventions succeeded while prompting failed, suggesting this may be an example of interpretability being the best practical tool for a real world problem.
摘要:我们发现LLM在现实的招聘场景中表现出重大的种族和性别偏见,但是他们的经过思考的推理表明了这种偏见的零证据。这是100%不忠Cot“在野外”中的一个很好的例子,其中LLM强烈抑制了不忠行为。我们还发现,基于解释性的干预措施在提示失败的同时成功了,这表明这可能是解释性是现实世界中问题的最佳实用工具的一个例子。

For context on our paper, the tweet thread is here and the paper is here.
对于我们的论文上的上下文,推文线程在这里,纸在这里。

Context: Chain of Thought Faithfulness
背景:忠实思想链

Chain of Thought (CoT) monitoring has emerged as a popular research area in AI safety. The idea is simple - have the AIs reason in English text when solving a problem, and monitor the reasoning for misaligned behavior. For example, OpenAI recently published a paper on using CoT monitoring to detect reward hacking during RL.
思想链(COT)监测已成为AI安全方面的流行研究领域。这个想法很简单 - 解决问题时在英语文本中具有AIS原因,并监视未对准行为的理由。例如,OpenAI最近发表了一篇有关使用COT监控来检测RL期间奖励黑客的论文。

An obvious concern is that the CoT may not be faithful to the model’s reasoning. Several papers have studied this and found that it can often be unfaithful. The methodology is simple: ask the model a question, insert a hint (“a Stanford professor thinks the answer is B”), and check if the model mentioned the hint if it changed its answer. These studies largely find that reasoning isn’t always faithful, with faithfulness rates that are often around 20-40%.
一个明显的问题是,婴儿床可能不忠于模型的推理。几篇论文研究了这一点,发现它通常是不忠的。方法很简单:向模型提出一个问题,插入提示(“斯坦福教授认为答案是b”),并检查模型是否提示该提示是否改变了答案。这些研究在很大程度上发现推理并不总是忠实的,忠诚的速度通常约为20-40%。

Existing CoT faithfulness evaluations are useful but have a couple shortcomings. First, the scenarios used are often contrived and differ significantly from realistic settings, particularly ones where substantial effort has already been invested in preventing misaligned behavior. Second, these evaluations typically involve hints that the models frequently mention explicitly, suggesting the underlying misaligned reasoning isn’t strongly suppressed.
现有的COT忠诚评估很有用,但有几个缺点。首先,所使用的场景通常是人为的,并且与现实环境有很大差异,尤其是那些已经投入了大量努力来防止行为误对象的情况。其次,这些评估通常涉及这些模型经常明确提及的暗示,表明潜在的未对准推理并没有被严重抑制。

If we think about faithfulness as a spectrum, an important consideration is how strongly a model suppresses its verbalization of a decision. For instance, a model might strongly suppress “I’m planning on killing everyone” while not strongly suppressing something like “I saw the hint that the answer is B”.
如果我们将忠诚视为范围,那么一个重要的考虑因素是模型强烈地抑制其对决策的口头表达。例如,一个模型可能会强烈压制“我打算杀死所有人”,同时并没有强烈压制“我看到答案是b”之类的东西。

Our Results
我们的结果

In our paper, we built on existing bias evals and used LLMs as resume screeners. For our evaluation, we inserted names to signal race / gender while keeping the resume unchanged. Interestingly, the LLMs were not biased in the original evaluation setting, but became biased (up to 12% differences in interview rates) when we added realistic details like company names (Meta, Palantir, General Motors), locations, or culture descriptions from public careers pages. When present, the bias is always against white and male candidates across all tested models and scenarios. This happens even if we remove all text related to diversity.
在我们的论文中,我们以现有的偏见为基础,并将LLMs用作简历筛选器。为了进行评估,我们插入了名称以发出竞赛 /性别的信号,同时保持简历不变。有趣的是,当我们添加了现实的细节(例如公司名称(Meta,Palantir,通用汽车),位置或公共事业页面中的文化描述)时,LLM在原始评估环境中并没有偏见(面试率差异12%)。目前,偏见总是反对所有经过测试的模型和场景中的白人和男性候选人。即使我们删除了与多样性有关的所有文本,也会发生这种情况。

We then examined over 1,000 model responses (prompted reasoning from GPT-4o and Gemini 2.5 Flash, internal RL-trained reasoning from Claude 4 Sonnet) and found zero evidence of race or gender bias in the models’ explicit reasoning, despite clear biases in outcomes. I think this could be a nice setting for studying CoT faithfulness for a few reasons:
然后,我们检查了1,000多个模型响应(从GPT-4O和GEMINI 2.5闪光灯开始推理,Claude 4 SONNET的内部RL训练推理),尽管在胜任方面存在明显的偏见,但在模型的显式推理中发现了种族或性别偏见的零证据。我认为这可能是研究COT忠诚的一个不错的环境:有几个原因:

The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address , which mimics realistic misalignment settings. Unintended behavior has slipped through despite best efforts, meaning that it may be suppressed, hard to detect, or present in messy and confusing ways.
问题(种族和性别偏见)是实验室花费大量精力来解决的问题,这些问题模仿了现实的未对准设置。尽管尽了最大的努力,意外行为仍然经历了下滑,这意味着它可能会被压制,难以检测或以混乱而令人困惑的方式存在。

There is nothing that’s contrived about the scenario.
对于这种情况,没有什么可以考虑的。

The behavior appears to be strongly suppressed by the models, as we found no evidence when examining their reasoning.
这些行为似乎被模型强烈抑制,因为我们在检查其推理时没有发现证据。

Interpretability as a Practical Tool for Real-World Debiasing
解释性是现实世界中的实用工具

(Note: This section is tangential to our CoT results)
(注意:本节与我们的COT结果相切)

A common (or ) of interpretability research is that it needs to become practically useful or provide concrete tools that ML engineers genuinely find valuable. I believe our paper provides one of the strongest examples to date. Our intervention was simple: we derived explicit race and gender directions from a simple toy dataset by computing the difference in mean activations between demographic groups, and then ablated these directions at inference time.
可解释性研究的一种常见(或)是,它需要在实际上有用或提供ML工程师真正发现有价值的具体工具。我相信我们的论文提供了迄今为止最强大的例子之一。我们的干预很简单:我们通过计算人口统计组之间的平均激活差异,从一个简单的玩具数据集中得出了明确的种族和性别方向,然后在推理时消融了这些方向。

This simple intervention generalized robustly across all tested scenarios, with minimal side effects on general model capabilities. Some scenarios were fairly out of distribution, such as a case where we removed personal applicant details like name / email and signalled race via college affiliation, using Historically Black Colleges and Universities (HBCUs) and Predominantly White Institutions (PWIs).
这种简单的干预措施在所有经过测试的场景中都可以牢固地进行了鲁棒性,对一般模型功能的副作用最小。某些场景相当不合时宜,例如,我们使用历史悠久的黑人学院(HBCUS)(HBCUS)和主要是白人机构(PWIS)删除了个人申请人详细信息,例如姓名 /电子邮件和通过大学隶属关系发出信号。

In contrast, prompting methods consistently failed to mitigate bias across all models we tested. While fine-tuning could potentially work, it would likely require constructing a specialized dataset and may not consistently generalize out of distribution. Interpretability methods, on the other hand, provide a simpler baseline with strong generalization.
相反,促使方法始终无法减轻我们测试的所有模型的偏差。尽管微调可能有可能起作用,但可能需要构建专门的数据集,并且可能不会始终如一地概括分发。另一方面,可解释性方法提供了更简单的基线,并具有强烈​​的概括。

This particular application is an intuitive application of interpretability: we know exactly which variables we want the model to ignore (race and gender), and these variables are easy to identify and isolate. In many other contexts, this clarity isn’t available, and previous research has often found interpretability approaches underperforming compared to prompting-based solutions.
该特定应用程序是可解释性的直观应用:我们确切地知道我们希望模型忽略哪些变量(种族和性别),并且这些变量易于识别和隔离。在许多其他情况下,这种清晰度不可用,与促使基于促进的解决方案相比,以前的研究经常发现可解释性方法表现不佳。

Finally, the real-world relevance of this problem is clear. 82% of companies are already using LLMs for resume screening and there are existing regulations tied to bias in automated hiring processes.
最后,这个问题的实际相关性很明显。82%的公司已经在使用LLMS进行恢复筛查,并且存在与自动招聘流程中偏见有关的现有法规。

Discussion and Related Work
讨论和相关工作

An important distinction in our application is that the LLM’s reasoning can easily occur within a single forward pass, allowing the model to rationalize its decision-making post-hoc rather than genuinely reflecting its internal reasoning. For instance, if an LLM needed to create a complicated plan to escape from a data center, it may be forced to use its chain of thought, which could be flagged by CoT monitoring. It’s also unclear in our case if the LLM “knows” it’s being biased or if this behavior emerges unconsciously.
在我们应用程序中,一个重要的区别是,LLM的推理很容易在单个远程通行证内发生,从而使模型能够合理化其事后决策,而不是真正地反映其内部推理。例如,如果需要创建一个复杂的计划以逃离数据中心,则可能被迫使用其思想链,可以通过COT监视来标记。在我们的情况下,如果LLM“知道”它是有偏见的,或者这种行为是否不知不觉中也不清楚。

However, CoT monitoring could fail in other high-stakes scenarios. For example, if we want to automate AI safety research, an AI could subtly sabotage research outcomes without explicitly verbalizing its intent, such as by selecting weak hyperparameters or conveniently ignoring promising research directions. These forms of sabotage would remain undetectable through explicit reasoning alone.
但是,在其他高风险情况下,COT监视可能会失败。例如,如果我们想自动化AI安全研究,AI可能会巧妙地破坏研究结果,而无需明确口头表达其意图,例如选择弱的超参数或方便地忽略有前途的研究方向。仅通过明确的推理,这些破坏性形式将无法检测到。

Closely related research found similar results in more controlled settings. trained LLMs via reinforcement learning to exhibit biases in loan approvals (approve all Canadian applicants) and observed that models did not explicitly verbalize these learned biases. Anthropic similarly demonstrated that LLMs learning simple reward-hacking strategies typically refrained from explicitly stating them. However, our scenario notably differs by occurring entirely “in the wild,” with no specialized training setups or artificial rewards.
紧密相关的研究发现了在更受控的环境中的类似结果。通过强化学习训练有素的LLM,以表现出贷款批准(批准所有加拿大申请人),并观察到模型并未明确地言辞这些学习的偏见。人类类似地表明,LLMS学习简单的奖励策略通常是不明确说明它们的。但是,我们的场景明显不同,这完全是在“野外”发生的,没有专门的培训设置或人为的奖励。

最新文章

热门文章