About me
I am a researcher on the interpretability team at Anthropic where I try to reverse engineer how language models compose microscopic building blocks into higher level computational circuits.
I completed my PhD at MIT where I worked with Dimitris Bertsimas, Max Tegmark, and Neel Nanda on interpretability. Before that, I was a software engineer at Google working on data pipelines for the storage analytics team, and researched fairness-optimized political redistricting with David Shmoys.
Selected Publications
See Google Scholar for a full and up-to-date list.
- Refusal in Language Models is Mediated by a Single Direction
by Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda
Appeared at NeurIPS 2024 [arXiv] - Not All Language Model Features are Linear
by Joshua Engels, Eric J Michaud, Isaac Liao, Wes Gurnee, Max Tegmark
Appearing at ICLR 2025 [arXiv] - Confidence Regulation Neurons in Language Models
by Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, Neel Nanda
Appeared at NeurIPS 2024 [arXiv] - Universal Neurons in GPT2 Language Models
by Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas
Published in TMLR [arXiv] [Twitter] - Language Models Represent Space and Time
by Wes Gurnee and Max Tegmark.
Appeared at ICLR 2024 [arXiv] [Twitter] - Finding Neurons in a Haystack: Case Studies with Sparse Probing
by Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dimitrii Troitskii, and Dimitris Bertsimas
Published in TMLR [Paper] [arXiv] [Twitter] - Learning Sparse Nonlinear Dynamics via Mixed-Integer Optimization
by Wes Gurnee and Dimitris Bertsimas.
Published in Nonlinear Dynamics [Paper] [arXiv] [Twitter] - Combatting gerrymandering with social choice: The design of multi-member districts
by Nikhil Garg, Wes Gurnee, David Rothschild, David Shmoys
Published in EC ‘22 [Paper] [arXiv] [Talk] - Fairmandering: A column generation heuristic for fairness-optimized political districting
by Wes Gurnee and David Shmoys
Best paper award at SIAM ACDA ‘21 [Paper] [arXiv] [Talk]
Other Projects and Writing
- SAE reconstruction errors are (empirically) pathological (2024) - A preliminary research post on a potential issue with sparse autoencoder reconstructions.
- Inductive Biases of SGD Training (2022) - A review of inductive biases of stochastic gradient descent (SGD) when training deep neural network.
- Analytics for Health Security (2022) - An analytics enabled defense-in-depth strategy for health security.
- Optimal Political Districting: The Anchor Method (2022) - A formulation of optimal political districting using the anchor method.
- Fairmandering: Generating Fairness-optimized Political Districts (SIAM News; 2021)
- Scalable Approximation of k-medians for Political Districting (2020) - Using a linear programming relaxation to approximate the k-medians problem for political districting.