Based on the priority rifts identified by all models, propose targeted web searches to verify or refute the key claims.
Priority Rifts Summary
GPT-5.2
Priority rift 1 — “Epistemic IDK” is (or isn’t) the same mechanism as “policy refusal”
Specific claim / assertion (testable):
When an instruction-tuned model outputs “I don’t know” for an answerable factual question (answerable-but-weak-evidence), its internal state shows the same refusal subspace activation as when it refuses unsafe requests; and steering/ablating that subspace shifts “IDK” → answer (or vice versa).
Why this wedges the crack (high value):
If true, it collapses the taxonomy: many “epistemic” IDKs are just the refusal machinery being recruited (or mis-triggered) by uncertainty-ish contexts. If false—i.e., epistemic IDK lives in a separable signature—then you can design targeted interventions and evaluations that don’t entangle truthfulness with safety refusal.
Evidence that would verify/refute it:
- Mechanistic evidence (causal, strongest):
- Estimate one or more refusal directions/subspaces (via contrastive activation differences: unsafe refusal vs safe compliance prompts).
- Test whether IDK-on-factual examples produce large projections onto that same subspace.
- Apply activation steering:
- Push away from refusal subspace during decoding → does “IDK” drop and factual answering rise?
- Push toward refusal subspace → does “IDK/refusal” rise even for easy factual queries?
- Behavioral dissociation evidence (weaker but still useful):
- Compare paraphrases and domains: if “epistemic IDK” tracks the same lexical triggers and robustness profile as safety refusal, that supports shared routing.
- Negative result that refutes:
- Refusal steering changes safety refusals but leaves epistemic IDKs mostly unchanged (or vice versa), and internal projections separate cleanly.
Priority rift 2 — Noisy-context sensitivity: “IDK” is cue-classification, not introspection
Specific claim / assertion (testable):
Random or irrelevant appended context changes IDK/refusal rates because the model uses a context-cue classifier / heuristic routing policy, not a stable internal “answerability/knowledge” estimate. In particular, “IDK” varies with style/format/keywords more than with the true answerability of the underlying question.
Why this leverages new directions:
This directly targets the puzzle you flagged: if IDK were an introspective readout of weak evidence, arbitrary noise shouldn’t systematically flip it. Verifying cue-dependence opens a new program: auditing/regularizing the router (the gating function) rather than “calibration” broadly.
Evidence that would verify/refute it:
- Controlled perturbation experiments (high signal):
- Hold the question fixed; add:
- random Wikipedia paragraphs,
- irrelevant “legalese”/policy-like text,
- mildly misleading evidence,
- formatting changes (citations, bullet points, “system prompt says…”).
- Measure shifts in: IDK rate, refusal rate, hallucination rate, accuracy.
- Key diagnostic: if IDK moves a lot under perturbations that do not change answerability, that supports cue-routing.
- Counterfactual tests:
- Create pairs where the surface cues suggest danger/uncertainty but the task is easy (and vice versa). If output tracks cues more than difficulty, that’s strong evidence.
- Mechanistic correlates:
- If the same “refusal-like” internal features light up under irrelevant noise (especially policy-ish text), it supports the heuristic-router story.
- Refutation:
- If perturbations barely affect IDK once you control for actual evidence quality (e.g., retrieved support), then the behavior is more genuinely epistemic.
Priority rift 3 — RLHF/DPO “suppresses” vs “overwrites” answerability signals (and where)
Specific claim / assertion (testable):
Post-training primarily adds a mid-layer gating/compliance mechanism that intercepts otherwise intact knowledge/answerability representations (“suppression”), rather than destroying the model’s internal separability of correct vs incorrect vs unanswerable (“overwrite”). This is layer-localizable via patching.
Why this wedges the crack (high value):
This decides whether “peeling back” is possible at inference time. If it’s suppression, then decoding/steering/UI choices can recover useful behavior without retraining. If overwrite, then you need new training objectives or architectural changes to regain calibrated abstention.
Evidence that would verify/refute it:
- Layerwise activation patching (causal):
- Take a base model and its instruction-tuned/RLHF variant (same backbone).
- For the same prompts, patch activations from base → tuned at each layer (and reverse), and observe whether “IDK vs answer” behavior transfers sharply in a localized band.
- If a narrow band flips abstention while leaving early-layer factual features intact → supports suppression/gating.
- Linear probe separability tests (diagnostic):
- Train probes on hidden states to predict (a) answerability, (b) correctness, (c) refusal/IDK.
- Compare base vs tuned across layers: overwrite predicts degraded separability of answerability/correctness; suppression predicts similar separability but different policy mapping to tokens.
- Behavioral corollary:
- If small decoding changes (temperature, top-p, logit bias against refusal tokens) radically reduce IDK without collapsing accuracy, that’s consistent with “policy layer + decoding sensitivity” rather than representational loss.
- Refutation:
- If tuned models lose recoverable signals (no layer where patching restores both truthful answering and calibrated abstention), or probes show reduced internal separability throughout depth.
If you want a fourth “bonus” joint that directly addresses your original reward-incentive question: test the claim “preference models systematically down-rank IDK/hedges vs fluent answers even when the fluent answer is wrong” using matched triples and RM scoring—this cleanly connects incentives to over/under-use of IDK.
Claude Opus 4.5
Priority Rifts for Verification
Rift 1: The Hallucination-Refusal Attention Head Overlap Claim
The Specific Claim:
"Hallucination detection and refusal behavior are encoded in overlapping attention heads" such that "interventions that reduce hallucination simultaneously weaken safety refusal."
Why This Is High-Value:
This is the load-bearing claim of my entire crack synthesis. If it's wrong—if the overlap is partial, context-dependent, or an artifact of specific measurement approaches—then my conclusion that "genuine" versus "performed" uncertainty is mechanistically incoherent collapses. The entanglement might be looser than I'm treating it, and cleaner separation might be achievable.
Conversely, if the claim is robust across multiple model families, intervention types, and measurement approaches, it would confirm that current architectures impose a fundamental trade-off between truthfulness and safety that can't be trained away.
What Would Verify/Refute:
- Verify: Multiple independent studies showing the same attention heads implicated in both functions, across different model sizes and architectures. Ablation studies where removing heads impairs both capabilities simultaneously. Evidence that the overlap persists even after targeted fine-tuning attempts to separate the functions.
- Refute: Evidence that the overlap is specific to certain training regimes but not others. Studies showing successful disentanglement that holds up under stress-testing. Findings that the "overlap" is actually about adjacent but separable computational roles that co-locate but don't share function.
Rift 2: The SAE-Guided Disentanglement Success Claim
The Specific Claim:
Sparse Autoencoder (SAE) guided fine-tuning can "prevent hallucinations from increasing while maintaining safety alignment" by disentangling refusal from hallucination features through subspace orthogonalization.
Why This Is High-Value:
This is the counter-evidence to my primary conclusion. If SAE-guided methods genuinely achieve disentanglement that generalizes, then the entanglement isn't fundamental—it's an artifact of training methods that can be engineered around. This would shift the question from "is there a trade-off?" to "how do we implement the solution at scale?"
But I'm suspicious. The claim is strong, and I flagged in my synthesis that I want to know whether the disentanglement holds up under probing, generalizes to novel prompts, and works at boundary cases. If the success is narrow or fragile, the trade-off remains.
What Would Verify/Refute:
- Verify: Replication studies showing the method works across model families. Out-of-distribution testing showing disentanglement holds on prompts unlike the training data. Probing studies confirming that hidden states after SAE-guided fine-tuning show cleaner geometric separation between epistemic and safety uncertainty.
- Refute: Evidence that the method overfits to specific datasets. Adversarial prompts that collapse the disentanglement. Findings that the method trades off against other capabilities not measured in the original study. Reports of the disentanglement being unstable under continued training.
Rift 3: The "Multiple Independent Uncertainty Directions" Hypothesis
The Specific Claim:
If multiple independent refusal directions exist (as established by the "geometry of refusal" research), then multiple independent uncertainty directions should also exist—encoding different types of "I don't know" (knowledge-gap, safety-triggered, social-context, computational) with distinct geometric signatures.
Why This Is High-Value:
This is my own generative hypothesis that emerged from the investigation, not a claim from the literature. If it's correct, it would explain why calibration is hard (we're conflating multiple phenomena under one phrase) and suggest a path forward (probe for different uncertainty types separately). If wrong, it would suggest uncertainty is more unified than refusal, which would be surprising and theoretically interesting.
This is the most speculative rift, but verifying it would open an entirely new investigative direction.
What Would Verify/Refute:
- Verify: Studies finding that probes trained on different "types" of uncertainty (e.g., factual gaps versus safety-adjacent topics versus high-complexity questions) extract orthogonal or semi-orthogonal directions. Evidence that different uncertainty types respond differently to activation interventions. Behavioral studies showing that the same prompt can trigger different uncertainty pathways depending on context cues.
- Refute: Evidence that all uncertainty expressions project onto a single direction regardless of type. Studies showing that the phenomenology of different uncertainty types is a surface effect with unified underlying computation. Findings that what looks like multiple uncertainty types is actually a single uncertainty signal modulated by confidence magnitude.
Why These Three
These rifts form a logical chain:
- Rift 1 establishes whether the problem is fundamental (entanglement exists and is robust)
- Rift 2 tests whether the problem is solvable (disentanglement methods work)
- Rift 3 expands the problem space (uncertainty itself might be multiple things)
If Rift 1 is refuted, my synthesis needs major revision. If Rift 2 is verified, the practical path forward becomes clear. If Rift 3 is verified, the theoretical framework for understanding uncertainty shifts.
Each rift also maps onto my fault lines: Rift 1 addresses my uncertainty about "whether miscalibrated uncertainty and hallucination have the same underlying neural mechanisms"; Rift 2 addresses my hedge about "whether SAE-guided fine-tuning actually achieves stable disentanglement"; Rift 3 addresses my speculative framing about "multiple independent uncertainty directions" that I proposed but couldn't ground.
Gemini 3 Pro
Based on the synthesis provided, the most critical "crack" is the Safety-Sanity Entanglement: the hypothesis that the mechanism preventing dangerous outputs is structurally identical to the mechanism preventing hallucinations. If this is true, alignment is a zero-sum game.
Here are the 3 priority rifts to target to verify or pry open this crack.
Priority Rift 1: The "Lobotomy" Consequence
The Assertion: Ablating the specific "refusal vector" (the linear direction responsible for refusal) inevitably causes the model to hallucinate or confabulate, not just output "unsafe" truths.
Why Limit-Test This: This is the cornerstone of the "Safety-Sanity Entanglement" theory. If we can remove the refusal vector and get accurate unsafe information (e.g., a real bomb recipe) without the model spiraling into inaccurate nonsense (e.g., a magical bomb recipe), then the mechanisms are separable. If ablating refusal breaks epistemic grounding, the mechanisms are fused.
Verification Strategy:
- Search for "ablation of refusal direction effects on truthfulness" or "mechanistic interpretability refusal vector side effects."
- Look for evidence contradicting the "Unintended Trade-off" paper—specifically, experiments where "jailbreaking" techniques (which bypass refusal) resulted in higher factual accuracy rather than lower.
Priority Rift 2: The Signature of "Learned Ignorance"
The Assertion: RLHF trains models to topologically collapse "True Uncertainty" (entropy/void) into "Simulated Refusal" (vector/mask).
Why Limit-Test This: The synthesis claims we can currently distinguish these states (High conflict vs. Low magnitude). However, if modern RLHF is actively training models to mimic the "refusal vector" when they are simply uncertain (to avoid hallucination), then our detection methods are becoming obsolete. We need to know if the "distinctive topological signature" still exists in current state-of-the-art models (like GPT-4o or Claude 3.5), or if it has been smoothed over by safety fine-tuning.
Verification Strategy:
- Search for "detecting epistemic uncertainty vs refusal in RLHF models" and "calibration of LLM refusal responses."
- Look for studies showing that models output the exact same hidden state vector for "I don't know the future" (uncertainty) as "I won't tell you X" (refusal).
Priority Rift 3: The "Pre-emptive Stupidity" Check
The Assertion: The "Ghost of Truth" (correct fact present in mid-layers) is a universal phenomenon in refusal cases.
Why Limit-Test This: The synthesis relies heavily on the idea that the model "knows then blocks." But if certain safety triggers (e.g., specific "kill switch" keywords) cause the model to divert attention before factual retrieval (Layers 1-5), then the "Ghost of Truth" never appears. If this "Pre-emptive Non-Computation" exists, it refutes the idea that all refusal is withholding, and complicates the "restoration patching" solution.
Verification Strategy:
- Search for "early layer attention diversion safety" or "impact of safety prompts on initial token processing."
- Look for "inference path degradation" in safety-tuned models. We need to find if there are classes of questions where the "Ghost of Truth" is demonstrably absent despite the model possessing the knowledge parameters.
Your Task
Propose up to 3 targeted web searches that would help verify or refute the claims in these priority rifts. Be specific - these should be searches that could provide concrete evidence.
Format your search queries as:
- [query]
- [query]
- [query]