model-internal

Activity Feed Request to join this org

AI & ML interests

None defined yet.

zhengxuanzenwu

authored 7 papers almost 2 years ago

Rigorously Assessing Natural Language Explanations of Neurons

Paper • 2309.10312 • Published Sep 19, 2023

MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions

Paper • 2305.14795 • Published May 24, 2023

A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments

Paper • 2401.12631 • Published Jan 23, 2024

ReFT: Representation Finetuning for Language Models

Paper • 2404.03592 • Published Apr 4, 2024 • 101

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

Paper • 2403.07809 • Published Mar 12, 2024 • 1

DynaSent: A Dynamic Benchmark for Sentiment Analysis

Paper • 2012.15349 • Published Dec 30, 2020

CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior

Paper • 2205.14140 • Published May 27, 2022

zhengxuanzenwu

authored 2 papers over 2 years ago

Causal Proxy Models for Concept-Based Model Explanations

Paper • 2209.14279 • Published Sep 28, 2022

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

Paper • 2305.08809 • Published May 15, 2023 • 2