About me

I am a researcher on the interpretability team at Anthropic where I try to reverse engineer how language models compose microscopic building blocks into higher level computational circuits.

I completed my PhD at MIT where I worked with Dimitris Bertsimas, Max Tegmark, and Neel Nanda on interpretability. Before that, I was a software engineer at Google working on data pipelines for the storage analytics team, and researched fairness-optimized political redistricting with David Shmoys.

Selected Publications

See Google Scholar for a full and up-to-date list.

When Models Manipulate Manifolds: The Geometry of a Counting Task
by Wes Gurnee, Emmanuel Ameisen, Isaac Kauvar, Julius Tarng, Adam Pearce, Chris Olah, and Joshua Batson
Transformer Circuits Thread [Paper] [Twitter]
On the Biology of a Large Language Model
by Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, (many others), and Joshua Batson
Transformer Circuits Thread [Paper] [Blog]
Circuit Tracing: Revealing Computational Graphs in Language Models
by Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, (many others), and Joshua Batson
Transformer Circuits Thread [Paper] [Blog]
Refusal in Language Models is Mediated by a Single Direction
by Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda
Appeared at NeurIPS 2024 [arXiv]
Not All Language Model Features are Linear
by Joshua Engels, Eric J Michaud, Isaac Liao, Wes Gurnee, Max Tegmark
Appearing at ICLR 2025 [arXiv]
Confidence Regulation Neurons in Language Models
by Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda
Appeared at NeurIPS 2024 [arXiv]
Universal Neurons in GPT2 Language Models
by Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas
Published in TMLR [arXiv] [Twitter]
Language Models Represent Space and Time
by Wes Gurnee and Max Tegmark.
Appeared at ICLR 2024 [arXiv] [Twitter]
Finding Neurons in a Haystack: Case Studies with Sparse Probing
by Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dimitrii Troitskii, and Dimitris Bertsimas
Published in TMLR [Paper] [arXiv] [Twitter]
Learning Sparse Nonlinear Dynamics via Mixed-Integer Optimization
by Wes Gurnee and Dimitris Bertsimas.
Published in Nonlinear Dynamics [Paper] [arXiv] [Twitter]
Combatting gerrymandering with social choice: The design of multi-member districts
by Nikhil Garg, Wes Gurnee, David Rothschild, David Shmoys
Published in EC ‘22 [Paper] [arXiv] [Talk]
Fairmandering: A column generation heuristic for fairness-optimized political districting
by Wes Gurnee and David Shmoys
Best paper award at SIAM ACDA ‘21 [Paper] [arXiv] [Talk]

Other Projects and Writing

SAE reconstruction errors are (empirically) pathological (2024) - A preliminary research post on a potential issue with sparse autoencoder reconstructions.
Inductive Biases of SGD Training (2022) - A review of inductive biases of stochastic gradient descent (SGD) when training deep neural network.
Analytics for Health Security (2022) - An analytics enabled defense-in-depth strategy for health security.
Optimal Political Districting: The Anchor Method (2022) - A formulation of optimal political districting using the anchor method.
Fairmandering: Generating Fairness-optimized Political Districts (SIAM News; 2021)
Scalable Approximation of k-medians for Political Districting (2020) - Using a linear programming relaxation to approximate the k-medians problem for political districting.

Wes Gurnee

Selected Publications

Other Projects and Writing