Publications | NLP@DSAI

2025

Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion

Denitsa Saynova, Lovisa Hagström, Moa Johansson, Richard Johansson, and Marco Kuhlmann

In Findings of the Association for Computational Linguistics: ACL 2025, Jul 2025

Abs PDF

Language models (LMs) can make a correct prediction based on many possible signals in a prompt, not all corresponding to recall of factual associations. However, current interpretations of LMs fail to take this into account. For example, given the query "Astrid Lindgren was born in" with the corresponding completion "Sweden", no difference is made between whether the prediction was based on knowing where the author was born or assuming that a person with a Swedish-sounding name was born in Sweden. In this paper, we present a model-specific recipe - PrISM - for constructing datasets with examples of four different prediction scenarios: generic language modeling, guesswork, heuristics recall and exact fact recall. We apply two popular interpretability methods to the scenarios: causal tracing (CT) and information flow analysis. We find that both yield distinct results for each scenario. Results for exact fact recall and generic language modeling scenarios confirm previous conclusions about the importance of mid-range MLP sublayers for fact recall, while results for guesswork and heuristics indicate a critical role of late last token position MLP sublayers. In summary, we contribute resources for a more extensive and granular study of fact completion in LMs, together with analyses that provide a more nuanced understanding of how LMs process fact-related queries.
Benchmarking Debiasing Methods for LLM-based Parameter Estimates

Nicolas Audinet de Pieuchon, Adel Daoud, Connor T. Jerzak, Moa Johansson, and Richard Johansson

arXiv preprint arXiv:2506.09627, Jul 2025

Abs PDF

Large language models (LLMs) offer an inexpensive yet powerful way to annotate text, but are often inconsistent when compared with experts. These errors can bias downstream estimates of population parameters such as regression coefficients and causal effects. To mitigate this bias, researchers have developed debiasing methods such as Design-based Supervised Learning (DSL) and Prediction-Powered Inference (PPI), which promise valid estimation by combining LLM annotations with a limited number of expensive expert annotations. Although these methods produce consistent estimates under theoretical assumptions, it is unknown how they compare in finite samples of sizes encountered in applied research. We make two contributions: First, we study how each method’s performance scales with the number of expert annotations, highlighting regimes where LLM bias or limited expert labels significantly affect results. Second, we compare DSL and PPI across a range of tasks, finding that although both achieve low bias with large datasets, DSL often outperforms PPI on bias reduction and empirical efficiency, but its performance is less consistent across datasets. Our findings indicate that there is a bias-variance tradeoff at the level of debiasing methods, calling for more research on developing metrics for quantifying their efficiency in finite samples.
CUB: Benchmarking Context Utilisation Techniques for Language Models

Lovisa Hagström, Youna Kim, Haeun Yu, Sang-goo Lee, Richard Johansson, Hyunsoo Cho, and Isabelle Augenstein

arXiv preprint arXiv:2505.16518, Jul 2025

Abs PDF

Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) that encourage or suppress context utilisation have recently been proposed to alleviate these issues, few have seen systematic comparison. In this paper, we develop CUB (Context Utilisation Benchmark) to help practitioners within retrieval-augmented generation (RAG) identify the best CMT for their needs. CUB allows for rigorous testing on three distinct context types, observed to capture key challenges in realistic context utilisation scenarios. With this benchmark, we evaluate seven state-of-the-art methods, representative of the main categories of CMTs, across three diverse datasets and tasks, applied to nine LMs. Our results show that most of the existing CMTs struggle to handle the full set of types of contexts that may be encountered in real-world retrieval-augmented scenarios. Moreover, we find that many CMTs display an inflated performance on simple synthesised datasets, compared to more realistic datasets with naturally occurring samples. Altogether, our results show the need for holistic tests of CMTs and the development of CMTs that can handle multiple context types.
A Reality Check on Context Utilisation for Retrieval-Augmented Generation

Lovisa Hagström, Sara Vera Marjanović, Haeun Yu, Arnav Arora, Christina Lioma, Maria Maistro, Pepa Atanasova, and Isabelle Augenstein

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Jul 2025

Abs PDF

Retrieval-augmented generation (RAG) helps address the limitations of the parametric knowledge embedded within a language model (LM). However, investigations of how LMs utilise retrieved information of varying complexity in real-world scenarios have been limited to synthetic contexts. We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance. The dataset is based on the prototypical task of automated claim verification, for which automated retrieval of real-world evidence is crucial. We compare DRUID to synthetic datasets (CounterFact, ConflictQA) and find that artificial datasets often fail to represent the complex and diverse real-world context settings. We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results, as measured by our novel ACU score. Moreover, while previous work has mainly focused on singleton context characteristics to explain context utilisation, correlations between singleton context properties and ACU on DRUID are surprisingly small compared to other properties related to context source. Overall, our work underscores the need for real-world aligned context utilisation studies to represent and improve performance in real-world RAG settings.
Language Model Re-rankers are Fooled by Lexical Similarities

Lovisa Hagström, Ercong Nie, Ruben Halifa, Helmut Schmid, Richard Johansson, and Alexander Junge

In Proceedings of the Eighth Fact Extraction and VERification Workshop (FEVER), Jul 2025

Abs PDF

Language model (LM) re-rankers are used to refine retrieval results for retrieval-augmented generation (RAG). They are more expensive than lexical matching methods like BM25 but assumed to better process semantic information. To understand whether LM re-rankers always live up to this assumption, we evaluate 6 different LM re-rankers on the NQ, LitQA2 and DRUID datasets. Our results show that LM re-rankers struggle to outperform a simple BM25 re-ranker on DRUID. Leveraging a novel separation metric based on BM25 scores, we explain and identify re-ranker errors stemming from lexical dissimilarities. We also investigate different methods to improve LM re-ranker performance and find these methods mainly useful for NQ. Taken together, our work identifies and explains weaknesses of LM re-rankers and points to the need for more adversarial and realistic datasets for their evaluation.
Identifying Non-Replicable Social Science Studies with Language Models

Denitsa Saynova, Kajsa Hansson, Bastiaan Bruinsma, Annika Fredén, and Moa Johansson

arXiv preprint arXiv:2503.10671, Jul 2025

Abs PDF

In this study, we investigate whether LLMs can be used to indicate if a study in the behavioural social sciences is replicable. Using a dataset of 14 previously replicated studies (9 successful, 5 unsuccessful), we evaluate the ability of both open-source (Llama 3 8B, Qwen 2 7B, Mistral 7B) and proprietary (GPT-4o) instruction-tuned LLMs to discriminate between replicable and non-replicable findings. We use LLMs to generate synthetic samples of responses from behavioural studies and estimate whether the measured effects support the original findings. When compared with human replication results for these studies, we achieve F1 values of up to 77% with Mistral 7B, 67% with GPT-4o and Llama 3 8B, and 55% with Qwen 2 7B, suggesting their potential for this task. We also analyse how effect size calculations are affected by sampling temperature and find that low variance (due to temperature) leads to biased effect estimates.
From Electrophoresis to Wikidata: Festschrift in honor of Pierre Nugues

Jul 2025

PDF
Exhuming a Swedish Temporal Relation Dataset from the Past

Richard Johansson

In From Electrophoresis to Wikidata: Festschrift in honor of Pierre Nugues, Jul 2025

Abs PDF

I resurrected a dataset containing annotated event-to-event temporal relations in Swedish traffic accident reports, and then benchmarked how well contemporary large language models classify these relations. Can the current generation of LLMs outperform our results from two decades ago?

2024

Deciphering the Interplay of Parametric and Non-parametric Memory in Retrieval-augmented Language Models

Mehrdad Farahani, and Richard Johansson

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

Abs PDF

Generative language models often struggle with specialized or less-discussed knowledge. A potential solution is found in Retrieval-Augmented Generation (RAG) models which act like retrieving information before generating responses. In this study, we explore how the Atlas approach, a RAG model, decides between what it already knows (parametric) and what it retrieves (non-parametric). We use causal mediation analysis and controlled experiments to examine how internal representations influence information processing. Our findings disentangle the effects of parametric knowledge and the retrieved context. They indicate that in cases where the model can choose between both types of information (parametric and non-parametric), it relies more on the context than the parametric knowledge. Furthermore, the analysis investigates the computations involved in \textithow the model uses the information from the context. We find that multiple mechanisms are active within the model and can be detected with mediation analysis: first, the decision of \textitwhether the context is relevant, and second, how the encoder computes output representations to support copying when relevant.
Can Large Language Models (or Humans) Disentangle Text?

Nicolas Audinet Pieuchon, Adel Daoud, Connor Jerzak, Moa Johansson, and Richard Johansson

In Proceedings of the 6th Workshop on Natural Language Processing and Computational Social Science (NLP+CSS), Nov 2024

Abs PDF

We investigate the potential of large language models (LLMs) to disentangle text variables–to remove the textual traces of an undesired forbidden variable in a task sometimes known as text distillation and closely related to the fairness in AI and causal inference literature. We employ a range of various LLM approaches in an attempt to disentangle text by identifying and removing information about a target variable while preserving other relevant signals. We show that in the strong test of removing sentiment, the statistical association between the processed text and sentiment is still detectable to machine learning classifiers post-LLM-disentanglement. Furthermore, we find that human annotators also struggle to disentangle sentiment while preserving other semantic content. This suggests there may be limited separability between concept variables in some text contexts, highlighting limitations of methods relying on text-level transformations and also raising questions about the robustness of disentanglement methods that achieve statistical independence in representation space.
Setting the AI Agenda – Evidence from Sweden in the ChatGPT Era

Bastiaan Bruinsma, Annika Fredén, Kajsa Hansson, Moa Johansson, Pasko Kisić-Merino, and Denitsa Saynova

In Proceedings of AEQUITAS 2024: Workshop on Fairness and Bias in AI, Nov 2024

Abs PDF

This paper examines the development of the Artificial Intelligence (AI) meta-debate in Sweden before and after the release of ChatGPT. From the perspective of agenda-setting theory, we propose that it is an elite outside of party politics that is leading the debate – i.e. that the politicians are relatively silent when it comes to this rapid development. We also suggest that the debate has become more substantive and risk-oriented in recent years. To investigate this claim, we draw on an original dataset of elite-level documents from the early 2010s to the present, using op-eds published in a number of leading Swedish newspapers. By conducting a qualitative content analysis of these materials, our preliminary findings lend support to the expectation that an academic, rather than a political elite is steering the debate.
Word embeddings on ideology and issues from Swedish parliamentarians’ motions: a comparative approach

Annika Fredén, Moa Johansson, and Denitsa Saynova

Journal of Elections, Public Opinion and Parties, Nov 2024

PDF
What Happens to a Dataset Transformed by a Projection-based Concept Removal Method?

Richard Johansson

In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, Nov 2024

Abs PDF

We investigate the behavior of methods that use linear projections to remove information about a concept from a language representation, and we consider the question of what happens to a dataset transformed by such a method. A theoretical analysis and experiments on real-world and synthetic data show that these methods inject strong statistical dependencies into the transformed datasets. After applying such a method, the representation space is highly structured: in the transformed space, an instance tends to be located near instances of the opposite label. As a consequence, the original labeling can in some cases be reconstructed by applying an anti-clustering method.

2023

The Effect of Scaling, Retrieval Augmentation and Form on the Factual Consistency of Language Models

Lovisa Hagström, Denitsa Saynova, Tobias Norlund, Moa Johansson, and Richard Johansson

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023

Abs PDF

Large Language Models (LLMs) make natural interfaces to factual knowledge, but their usefulness is limited by their tendency to deliver inconsistent answers to semantically equivalent questions. For example, a model might supply the answer “Edinburgh” to “Anne Redpath passed away in X.” and “London” to “Anne Redpath’s life ended in X.” In this work, we identify potential causes of inconsistency and evaluate the effectiveness of two mitigation strategies: up-scaling and augmenting the LM with a passage retrieval database. Our results on the LLaMA and Atlas models show that both strategies reduce inconsistency but that retrieval augmentation is considerably more efficient. We further consider and disentangle the consistency contributions of different components of Atlas. For all LMs evaluated we find that syntactical form and task artifacts impact consistency. Taken together, our results provide a better understanding of the factors affecting the factual consistency of language models.
Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models

Ehsan Doostmohammadi, Tobias Norlund, Marco Kuhlmann, and Richard Johansson

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Jul 2023

Abs PDF

Augmenting language models with a retrieval mechanism has been shown to significantly improve their performance while keeping the number of parameters low. Retrieval-augmented models commonly rely on a semantic retrieval mechanism based on the similarity between dense representations of the query chunk and potential neighbors. In this paper, we study the state-of-the-art Retro model and observe that its performance gain is better explained by surface-level similarities, such as token overlap. Inspired by this, we replace the semantic retrieval in Retro with a surface-level method based on BM25, obtaining a significant reduction in perplexity. As full BM25 retrieval can be computationally costly for large datasets, we also apply it in a re-ranking scenario, gaining part of the perplexity reduction with minimal computational overhead.
Sudden Semantic Shifts in Swedish NATO discourse

Brian Bonafilia, Bastiaan Bruinsma, Denitsa Saynova, and Moa Johansson

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), Jul 2023

Abs PDF

In this paper, we investigate a type of semantic shift that occurs when a sudden event radically changes public opinion on a topic. Looking at Sweden‘s decision to apply for NATO membership in 2022, we use word embeddings to study how the associations users on Twitter have regarding NATO evolve. We identify several changes that we successfully validate against real-world events. However, the low engagement of the public with the issue often made it challenging to distinguish true signals from noise. We thus find that domain knowledge and data selection are of prime importance when using word embeddings to study semantic shifts.
On the Generalization Ability of Retrieval-Enhanced Transformers

Tobias Norlund, Ehsan Doostmohammadi, Richard Johansson, and Marco Kuhlmann

In Findings of the Association for Computational Linguistics: EACL 2023, May 2023

Abs PDF

Recent work on the Retrieval-Enhanced Transformer (RETRO) model has shown impressive results: off-loading memory from trainable weights to a retrieval database can significantly improve language modeling and match the performance of non-retrieval models that are an order of magnitude larger in size. It has been suggested that at least some of this performance gain is due to non-trivial generalization based on both model weights and retrieval. In this paper, we try to better understand the relative contributions of these two components. We find that the performance gains from retrieval to a very large extent originate from overlapping tokens between the database and the test data, suggesting less of non-trivial generalization than previously assumed. More generally, our results point to the challenges of evaluating the generalization of retrieval-augmented language models such as RETRO, as even limited token overlap may significantly decrease test-time loss. We release our code and model at \urlhttps://github.com/TobiasNorlund/retro
An Empirical Study of Multitask Learning to Improve Open Domain Dialogue Systems

Mehrdad Farahani, and Richard Johansson

In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), May 2023

Abs PDF

Autoregressive models used to generate responses in open-domain dialogue systems often struggle to take long-term context into account and to maintain consistency over a dialogue. Previous research in open-domain dialogue generation has shown that the use of \textitauxiliary tasks can introduce inductive biases that encourage the model to improve these qualities. However, most previous research has focused on encoder-only or encoder/decoder models, while the use of auxiliary tasks in \textitencoder-only autoregressive models is under-explored. This paper describes an investigation where four different auxiliary tasks are added to small and medium-sized GPT-2 models fine-tuned on the PersonaChat and DailyDialog datasets. The results show that the introduction of the new auxiliary tasks leads to small but consistent improvement in evaluations of the investigated models.
Class Explanations: the Role of Domain-Specific Content and Stop Words

Denitsa Saynova, Bastiaan Bruinsma, Moa Johansson, and Richard Johansson

In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), May 2023

Abs PDF

We address two understudied areas related to explainability for neural text models. First, \textitclass explanations. What features are descriptive across a class, rather than explaining single input instances? Second, the \textittype of features that are used for providing explanations. Does the explanation involve the statistical pattern of word usage or the presence of domain-specific content words? Here, we present a method to extract both class explanations and strategies to differentiate between two types of explanations – domain-specific signals or statistical variations in frequencies of common words. We demonstrate our method using a case study in which we analyse transcripts of political debates in the Swedish Riksdag.

2022

Controlling for Stereotypes in Multimodal Language Model Evaluation

Manuj Malik, and Richard Johansson

In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, Sep 2022
Coveting Your Neighbor’s Wife: Using Lexical Neighborhoods in Substitution-based Word Sense Disambiguation

Richard Johansson

In LIVE and LEARN – Festschrift in honor of Lars Borin, Sep 2022

PDF
Cross-modal Transfer Between Vision and Language for Protest Detection

Ria Raj, Kajsa Andéasson, Tobias Norlund, Richard Johansson, and Aron Lagerberg

In Proceedings of the 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE), Sep 2022

PDF
Can We Use Small Models to Investigate Multimodal Fusion Methods?

Lovisa Hagström, Tobias Norlund, and Richard Johansson

In Proceedings of the 2022 CLASP Conference on (Dis)embodiment, Sep 2022

PDF
Conceptualizing Treatment Leakage in Text-based Causal Inference

Adel Daoud, Connor Jerzak, and Richard Johansson

In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jul 2022

PDF
What do Models Learn From Training on More Than Text? Measuring Visual Commonsense Knowledge

Lovisa Hagström, and Richard Johansson

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, May 2022

PDF
How to Adapt Pre-trained Vision-and-Language Models to a Text-only Input?

Lovisa Hagström, and Richard Johansson

In Proceedings of the 29th International Conference on Computational Linguistics (COLING), May 2022

PDF
Semi-supervised Learning with Natural Language Processing for Right Ventricle Classification in Echocardiography – a Scalable Approach

Eva Hagberg, David Hagerman, Richard Johansson, Nasser Hosseini, Jan Liu, Elin Björnsson, Jennifer Alvén, and Ola Hjelmgren

Computers in Biology and Medicine, May 2022

PDF

2021

Transferring Knowledge from Vision to Language: How to Achieve it and how to Measure it?

Tobias Norlund, Lovisa Hagström, and Richard Johansson

In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, May 2021

PDF
Knowledge Distillation for Swedish NER models: A Search for Performance and Efficiency

Lovisa Hagström, and Richard Johansson

In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021), May 2021

PDF