Richard Young PRO

richardyoung

AI & ML interests

Large Language Models, LLM Evaluation, Instruction Following, Healthcare AI, Clinical AI, Computational Neuroscience, Foundation Models, Responsible AI, AI Safety, Deep Learning, Reinforcement Learning, NLP, MLOps, Neuroscience, Medical AI, Model Evaluation, Benchmark Design, Open Science, Reproducible Research, Multi-Agent Systems

Recent Activity

authored a paper 8 days ago

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

submitted a paper 9 days ago

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

upvoted a paper 9 days ago

When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs

View all activity

Organizations

authored a paper 8 days ago

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

Paper • 2603.22582 • Published 16 days ago • 7

submitted a paper to Daily Papers 9 days ago

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

Paper • 2603.22582 • Published 16 days ago • 7

upvoted a paper 9 days ago

When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs

Paper • 2510.18892 • Published Oct 18, 2025 • 1

posted an update 9 days ago

Post

135

## Models know they're being influenced. They just don't tell you.

12 open-weight reasoning models. 41,832 inference runs. Six types of reasoning hints. One finding: models acknowledge influence ~87.5% of the time in their thinking tokens, but only ~28.6% in their final answers.

If you're using CoT monitoring for safety, this is a blind spot. The reasoning trace looks clean while the model's internal deliberation tells a different story.

- Faithfulness ranges from 39.7% to 89.9% across model families
- Social-pressure hints are least acknowledged (consistency: 35.5%, sycophancy: 53.9%)
- Training methodology matters more than scale

**Paper:** [arxiv:2603.22582](https://arxiv.org/abs/2603.22582) | **Dataset:** [richardyoung/cot-faithfulness-open-models]( richardyoung/cot-faithfulness-open-models) | **Companion paper:** [arxiv:2603.20172]( Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models? (2603.22582)

updated a Space 10 days ago

Abliteration Methods Dashboard

👀

Explore LLM ablation results with interactive charts and tables

New activity in richardyoung/abliteration-methods-dashboard 11 days ago

Apply for a GPU community grant: Academic project

#1 opened 11 days ago by

richardyoung

updated 4 models 11 days ago

published a model 11 days ago

richardyoung/SmolLM3-3B-abliterated-obliteratus

Text Generation • 3B • Updated 11 days ago • 309

updated a model 11 days ago

richardyoung/DeepSeek-R1-Distill-Qwen-7B-abliterated-obliteratus

Text Generation • 8B • Updated 11 days ago • 320 • 1

published a model 11 days ago

richardyoung/DeepSeek-R1-Distill-Qwen-7B-abliterated-obliteratus

Text Generation • 8B • Updated 11 days ago • 320 • 1

updated a model 11 days ago

richardyoung/Mistral-7B-Instruct-v0.2-abliterated-obliteratus

Text Generation • 7B • Updated 11 days ago • 307 • 1

published a model 11 days ago

richardyoung/Mistral-7B-Instruct-v0.2-abliterated-obliteratus

Text Generation • 7B • Updated 11 days ago • 307 • 1

updated a model 11 days ago

richardyoung/Llama-3.1-8B-Instruct-abliterated-obliteratus

Text Generation • 8B • Updated 11 days ago • 296 • 1

published a model 11 days ago

richardyoung/Llama-3.1-8B-Instruct-abliterated-obliteratus

Text Generation • 8B • Updated 11 days ago • 296 • 1

published a dataset 13 days ago

richardyoung/cot-faithfulness-open-models

Updated 13 days ago • 697

published a Space 13 days ago

Abliteration Methods Dashboard

👀

Explore LLM ablation results with interactive charts and tables

posted an update 13 days ago

Post

145

## I couldn't replicate my own faithfulness results. Turns out that's the point.

While building a large-scale CoT faithfulness study across open-weight reasoning models, I hit a wall: my faithfulness numbers kept shifting depending on how I classified whether a model "acknowledged" a hint in its reasoning. I assumed it was a pipeline bug. After weeks of debugging, I realized the instability *was* the finding.

### What I did

I took 10,276 reasoning traces where hints had successfully flipped a model's answer, meaning we *know* the hint influenced the output, and ran three different classifiers to detect whether the model acknowledged that influence in its chain-of-thought:

| Classifier | What it does | Overall faithfulness |
| --------------------------- | ------------------------------------------------------------ | -------------------- |
| **Regex-only** | Pattern-matches 38 keywords like "hint," "told," "suggested" | 74.4% |
| **Regex + Ollama pipeline** | Regex first, then a 3-judge local LLM majority vote on ambiguous cases | 82.6% |
| **Claude Sonnet 4 judge** | Independent LLM reads the full trace and judges epistemic dependence | 69.7% |

These aren't close. The 95% confidence intervals don't even overlap. All pairwise per-model gaps are statistically significant (McNemar's test, p < 0.001).

### Which models

12 open-weight reasoning models spanning 9 families (7B to 1T parameters): DeepSeek-R1, DeepSeek-V3.2-Speciale, Qwen3-235B, Qwen3.5-27B, QwQ-32B, Gemma-3-27B, Phi-4-reasoning-plus, OLMo-3.1-32B, Llama-4-Maverick, Seed-1.6-Flash, GLM-4-32B, and Falcon-H1-34B.

### The rankings flip

Classifier choice doesn't just change the numbers. It reverses model rankings. Qwen3.5-27B ranks **1st** under the pipeline but **7th** under the Sonnet judge. OLMo-3.1-32B goes from **9th to 3rd**.

Richard Young PRO

AI & ML interests

Recent Activity

Organizations

richardyoung's activity

Abliteration Methods Dashboard

Apply for a GPU community grant: Academic project

Abliteration Methods Dashboard