Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

Hase, Peter; Bansal, Mohit; Kim, Been; Ghandeharioun, Asma

Computer Science > Machine Learning

arXiv:2301.04213 (cs)

[Submitted on 10 Jan 2023 (v1), last revised 16 Oct 2023 (this version, v2)]

Title:Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

Authors:Peter Hase, Mohit Bansal, Been Kim, Asma Ghandeharioun

View PDF

Abstract:Language models learn a great quantity of factual information during pretraining, and recent work localizes this information to specific model weights like mid-layer MLP weights. In this paper, we find that we can change how a fact is stored in a model by editing weights that are in a different location than where existing methods suggest that the fact is stored. This is surprising because we would expect that localizing facts to specific model parameters would tell us where to manipulate knowledge in models, and this assumption has motivated past work on model editing methods. Specifically, we show that localization conclusions from representation denoising (also known as Causal Tracing) do not provide any insight into which model MLP layer would be best to edit in order to override an existing stored fact with a new one. This finding raises questions about how past work relies on Causal Tracing to select which model layers to edit. Next, we consider several variants of the editing problem, including erasing and amplifying facts. For one of our editing problems, editing performance does relate to localization results from representation denoising, but we find that which layer we edit is a far better predictor of performance. Our results suggest, counterintuitively, that better mechanistic understanding of how pretrained language models work may not always translate to insights about how to best change their behavior. Our code is available at this https URL

Comments:	NeurIPS 2023 (Spotlight). 26 pages, 22 figures
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2301.04213 [cs.LG]
	(or arXiv:2301.04213v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2301.04213

Submission history

From: Peter Hase [view email]
[v1] Tue, 10 Jan 2023 21:26:08 UTC (2,424 KB)
[v2] Mon, 16 Oct 2023 17:42:58 UTC (1,791 KB)

Computer Science > Machine Learning

Title:Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators