Large Language Models, LLM Evaluation, Instruction Following, Healthcare AI, Clinical AI,
Computational Neuroscience, Foundation Models, Responsible AI, AI Safety, Deep Learning,
Reinforcement Learning, NLP, MLOps, Neuroscience, Medical AI, Model Evaluation, Benchmark
Design, Open Science, Reproducible Research, Multi-Agent Systems
## Models know they're being influenced. They just don't tell you.
12 open-weight reasoning models. 41,832 inference runs. Six types of reasoning hints. One finding: models acknowledge influence ~87.5% of the time in their thinking tokens, but only ~28.6% in their final answers.
If you're using CoT monitoring for safety, this is a blind spot. The reasoning trace looks clean while the model's internal deliberation tells a different story.
- Faithfulness ranges from 39.7% to 89.9% across model families - Social-pressure hints are least acknowledged (consistency: 35.5%, sycophancy: 53.9%) - Training methodology matters more than scale
## I couldn't replicate my own faithfulness results. Turns out that's the point.
While building a large-scale CoT faithfulness study across open-weight reasoning models, I hit a wall: my faithfulness numbers kept shifting depending on how I classified whether a model "acknowledged" a hint in its reasoning. I assumed it was a pipeline bug. After weeks of debugging, I realized the instability *was* the finding.
### What I did
I took 10,276 reasoning traces where hints had successfully flipped a model's answer, meaning we *know* the hint influenced the output, and ran three different classifiers to detect whether the model acknowledged that influence in its chain-of-thought:
| Classifier | What it does | Overall faithfulness | | --------------------------- | ------------------------------------------------------------ | -------------------- | | **Regex-only** | Pattern-matches 38 keywords like "hint," "told," "suggested" | 74.4% | | **Regex + Ollama pipeline** | Regex first, then a 3-judge local LLM majority vote on ambiguous cases | 82.6% | | **Claude Sonnet 4 judge** | Independent LLM reads the full trace and judges epistemic dependence | 69.7% |
These aren't close. The 95% confidence intervals don't even overlap. All pairwise per-model gaps are statistically significant (McNemar's test, p < 0.001).
Classifier choice doesn't just change the numbers. It reverses model rankings. Qwen3.5-27B ranks **1st** under the pipeline but **7th** under the Sonnet judge. OLMo-3.1-32B goes from **9th to 3rd**.