Barbara McGillivray - ACL Anthology

Barbara McGillivray

Also published as: Barbara Mcgillivray

2026

Sense-Based Annotation of Geographical Nouns in Ancient Greek and Latin: A Diachronic Study with LLMs
Andrea Farina | Michele Ciletti | Barbara Mcgillivray | Andrea Ballatore
Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026

This paper investigates the lexicalisation of geographical nouns in Latin and Ancient Greek using a nd Ancient Greek using a diachronic, multi-genre corpus (8th cent. BCE – 2nd cent. CE) and Large Language Models for Word Sense Disambiguation. We focus on two main aspects: the onomasiological question of which words encode core geographical concepts, and the semasiological distribution of senses across lemmas. Across both languages, city-related concepts are the most frequently expressed, but Greek shows a stronger focus on maritime terms, whereas Latin favours concepts related to land. Semasiologically, Latin shows clearer evidence of semantic change over time (e.g., ’citizenship’ - ’city’, aequor ’flat surface’ - ’sea’), while Greek displays more gradual or distributed shifts. These results show that computational annotation enables cross-linguistic and diachronic analysis of spatial semantics, allowing us to compare the frequency of concepts across languages, genres, and periods, and to track when semantic change occurs and how core concepts evolve over time.

2025

An Annotation Protocol for Diachronic Evaluation of Semantic Drift in Disability Sources
Nitisha Jain | Chiara Di Bonaventura | Albert Merono Penuela | Barbara McGillivray
Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)

Annotating terms referring to aspects of disability in historical texts is crucial for understanding how societies in different periods conceptualized and treated disability. Such annotations help modern readers grasp the evolving language, cultural attitudes, and social structures surrounding disability, shedding light on both marginalization and inclusion throughout history. This is important as evolving societal attitudes can influence the perpetuation of harmful language that reinforces stereotypes and discrimination. However, this task presents significant challenges. Terminology often reflects outdated, offensive, or ambiguous concepts that require sensitive interpretation. Meaning of terms may have shifted over time, making it difficult to align historical terms with contemporary understandings of disability. Additionally, contextual nuances and the lack of standardized language in historical records demand careful scholarly judgment to avoid anachronism or misrepresentation.

From Detection to Explanation: Effective Learning Strategies for LLMs in Online Abusive Language Research
Chiara Di Bonaventura | Lucia Siciliani | Pierpaolo Basile | Albert Merono Penuela | Barbara McGillivray
Proceedings of the 31st International Conference on Computational Linguistics

Abusive language detection relies on understanding different levels of intensity, expressiveness and targeted groups, which requires commonsense reasoning, world knowledge and linguistic nuances that evolve over time. Here, we frame the problem as a knowledge-guided learning task, and demonstrate that LLMs’ implicit knowledge without an accurate strategy is not suitable for multi-class detection nor explanation generation. We publicly release GLlama Alarm, the knowledge-Guided version of Llama-2 instruction fine-tuned for multi-class abusive language detection and explanation generation. By being fine-tuned on structured explanations and external reliable knowledge sources, our model mitigates bias and generates explanations that are relevant to the text and coherent with human reasoning, with an average 48.76% better alignment with human judgment according to our expert survey.

Hatevolution: What Static Benchmarks Don’t Tell Us
Chiara Di Bonaventura | Barbara McGillivray | Yulan He | Albert Meroño-Peñuela
Findings of the Association for Computational Linguistics: ACL 2025

Language changes over time, including in the hate speech domain, which evolves quickly following social dynamics and cultural shifts. While NLP research has investigated the impact of language evolution on model training and has proposed several solutions for it, its impact on model benchmarking remains under-explored. Yet, hate speech benchmarks play a crucial role to ensure model safety. In this paper, we empirically evaluate the robustness of 20 language models across two evolving hate speech experiments, and we show the temporal misalignment between static and time-sensitive evaluations. Our findings call for time-sensitive linguistic benchmarks in order to correctly and reliably evaluate language models in the hate speech domain.

Finding the Paper Behind the Data: Automatic Identification of Research Articles related to Data Publications
Barbara McGillivray | Kaveh Aryan | Viola Harperath | Marton Ribary | Mandy Wigdorowitz
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications

Data papers are scholarly publications that describe datasets in detail, including their structure, collection methods, and potential for reuse, typically without presenting new analyses. As data sharing becomes increasingly central to research workflows, linking data papers to relevant research papers is essential for improving transparency, reproducibility, and scholarly credit. However, these links are rarely made explicit in metadata and are often difficult to identify manually at scale. In this study, we present a comprehensive approach to automating the linking process using natural language processing (NLP) techniques. We evaluate both set-based and vector-based methods, including Jaccard similarity, TF-IDF, SBERT, and reranking with large language models. Our experiments on a curated benchmark dataset reveal that no single method consistently outperforms others across all metrics, in line with the multifaceted nature of the task. Set-based methods using frequent words (N=50) achieve the highest top-10% accuracy, closely followed by TF-IDF, which also leads in MRR and top-1% and top-5% accuracy. SBERT-based reranking with LLMs yields the best results in top-N accuracy. This dispersion suggests that different approaches capture complementary aspects of similarity (lexical, semantic, and contextual), showing the value of hybrid strategies for robust matching between data papers and research articles. For several methods, we find no statistically significant difference between using abstracts and full texts, suggesting that abstracts may be sufficient for effective matching. Our findings demonstrate the feasibility of scalable, automated linking between data papers and research articles, enabling more accurate bibliometric analyses, improved tracking of data reuse, and fairer credit assignment for data sharing. This contributes to a more transparent, interconnected, and accessible research ecosystem.

Mapping Meaning in Latin with Large Language Models: A Multi-Task Evaluation of Preverbed Motion Verbs and Spatial Relation Detection in LLMs
Andrea Farina | Andrea Ballatore | Barbara McGillivray
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

2024

Language Pivoting from Parallel Corpora for Word Sense Disambiguation of Historical Languages: A Case Study on Latin
Iacopo Ghinassi | Simone Tedeschi | Paola Marongiu | Roberto Navigli | Barbara McGillivray
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Word Sense Disambiguation (WSD) is an important task in NLP, which serves the purpose of automatically disambiguating a polysemous word with its most likely sense in context. Recent studies have advanced the state of the art in this task, but most of the work has been carried out on contemporary English or other modern languages, leaving challenges posed by low-resource languages and diachronic change open. Although the problem with low-resource languages has recently been mitigated by using existing multilingual resources to propagate otherwise expensive annotations from English to other languages, such techniques have hitherto not been applied to historical languages such as Latin. In this work, we make the following two major contributions. First, we test such a strategy on a historical language and propose a new approach in this framework which makes use of existing bilingual corpora instead of native English datasets. Second, we fine-tune a Latin WSD model on the data produced and achieve state-of-the-art results on a standard benchmark for the task. Finally, we release the dataset generated with our approach, which is the largest dataset for Latin WSD to date. This work opens the door to further research, as our approach can be used for different historical and, generally, under-resourced languages.

LLODIA: A Linguistic Linked Open Data Model for Diachronic Analysis
Florentina Armaselu | Chaya Liebeskind | Paola Marongiu | Barbara McGillivray | Giedre Valunaite Oleskeviciene | Elena-Simona Apostol | Ciprian-Octavian Truica | Daniela Gifu
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024

This article proposes a linguistic linked open data model for diachronic analysis (LLODIA) that combines data derived from diachronic analysis of multilingual corpora with dictionary-based evidence. A humanities use case was devised as a proof of concept that includes examples in five languages (French, Hebrew, Latin, Lithuanian and Romanian) related to various meanings of the term “revolution” considered at different time intervals. The examples were compiled through diachronic word embedding and dictionary alignment.

Is Explanation All You Need? An Expert Survey on LLM-generated Explanations for Abusive Language Detection
Chiara Di Bonaventura | Lucia Siciliani | Pierpaolo Basile | Albert Merono Penuela | Barbara Mcgillivray
Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024)

Explainable abusive language detection has proven to help both users and content moderators, and recent research has focused on prompting LLMs to generate explanations for why a specific text is hateful. Yet, understanding the alignment of these generated explanations with human expectations and judgements is far from being solved. In this paper, we design a before-and-after study recruiting AI experts to evaluate the usefulness and trustworthiness of LLM-generated explanations for abusive language detection tasks, investigating multiple LLMs and learning strategies. Our experiments show that expectations in terms of usefulness and trustworthiness of LLM-generated explanations are not met, as their ratings decrease by 47.78% and 64.32%, respectively, after treatment. Further, our results suggest caution in using LLMs for explanation generation of abusive language detection due to (i) their cultural bias, and (ii) difficulty in reliably evaluating them with empirical metrics. In light of our results, we provide three recommendations to use LLMs responsibly for explainable abusive language detection.

2023

Towards a Conversational Web? A Benchmark for Analysing Semantic Change with Conversational Knowledge Bots and Linked Open Data
Florentina Armaselu | Elena-Simona Apostol | Christian Chiarcos | Anas Fahad Khan | Chaya Liebeskind | Barbara McGillivray | Ciprian-Octavian Truica | Andrius Utka | Giedrė Valūnaitė-Oleškevičienė
Proceedings of the 4th Conference on Language, Data and Knowledge

Graph Databases for Diachronic Language Data Modelling
Barbara McGillivray | Pierluigi Cassotti | Davide Di Pierro | Paola Marongiu | Anas Fahad Khan | Stefano Ferilli | Pierpaolo Basile
Proceedings of the 4th Conference on Language, Data and Knowledge

Workflow Reversal and Data Wrangling in Multilingual Diachronic Analysis and Linguistic Linked Open Data Modelling
Florentina Armaselu | Barbara McGillivray | Chaya Liebeskind | Giedrė Valūnaitė Oleškevičienė | Andrius Utka | Daniela Gifu | Anas Fahad Khan | Elena-Simona Apostol | Ciprian-Octavian Truica
Proceedings of the 4th Conference on Language, Data and Knowledge

Evaluation of Distributional Semantic Models of Ancient Greek: Preliminary Results and a Road Map for Future Work
Silvia Stopponi | Nilo Pedrazzini | Saskia Peels | Barbara McGillivray | Malvina Nissim
Proceedings of the Ancient Language Processing Workshop

We evaluate four count-based and predictive distributional semantic models of Ancient Greek against AGREE, a composite benchmark of human judgements, to assess their ability to retrieve semantic relatedness. On the basis of the observations deriving from the analysis of the results, we design a procedure for a larger-scale intrinsic evaluation of count-based and predictive language models, including syntactic embeddings. We also propose possible ways of exploiting the different layers of the whole AGREE benchmark (including both human- and machine-generated data) and different evaluation metrics.

2022

Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers
Nilo Pedrazzini | Barbara McGillivray
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities

The industrialization process associated with the so-called Industrial Revolution in 19th-century Great Britain was a time of profound changes, including in the English lexicon. An important yet understudied phenomenon is the semantic shift in the lexicon of mechanisation. In this paper we present the first large-scale analysis of terms related to mechanization over the course of the 19th-century in English. We draw on a corpus of historical British newspapers comprising 4.6 billion tokens and train historical word embedding models. We test existing semantic change detection techniques and analyse the results in light of previous historical linguistic scholarship.

Leveraging time-dependent lexical features for offensive language detection
Barbara McGillivray | Malithi Alahapperuma | Jonathan Cook | Chiara Di Bonaventura | Albert Meroño-Peñuela | Gareth Tyson | Steven Wilson
Proceedings of the First Workshop on Ever Evolving NLP (EvoNLP)

We present a study on the integration of time-sensitive information in lexicon-based offensive language detection systems. Our focus is on Offenseval sub-task A, aimed at detecting offensive tweets. We apply a semantic change detection algorithm over a short time span of two years to detect words whose semantics has changed and we focus particularly on those words that acquired or lost an offensive meaning between 2019 and 2020. Using the output of this semantic change detection approach, we train an SVM classifier on the Offenseval 2019 training set. We build on the already competitive SINAI system submitted to Offenseval 2019 by adding new lexical features, including those that capture the change in usage of words and their association with emerging offensive usages. We discuss the challenges, opportunities and limitations of integrating semantic change detection in offensive language detection models. Our work draws attention to an often neglected aspect of offensive language, namely that the meanings of words are constantly evolving and that NLP systems that account for this change can achieve good performance even when not trained on the most recent training data.

2021

When Time Makes Sense: A Historically-Aware Approach to Targeted Sense Disambiguation
Kaspar Beelen | Federico Nanni | Mariona Coll Ardanuy | Kasra Hosseini | Giorgia Tolfo | Barbara McGillivray
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages
Dominik Schlechtweg | Nina Tahmasebi | Simon Hengchen | Haim Dubossarsky | Barbara McGillivray
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Word meaning is notoriously difficult to capture, both synchronically and diachronically. In this paper, we describe the creation of the largest resource of graded contextualized, diachronic word meaning annotation in four different languages, based on 100,000 human semantic proximity judgments. We describe in detail the multi-round incremental annotation process, the choice for a clustering algorithm to group usages into senses, and possible – diachronic and synchronic – uses for this dataset.

2020

Embedding Structured Dictionary Entries
Steven Wilson | Walid Magdy | Barbara McGillivray | Gareth Tyson
Proceedings of the First Workshop on Insights from Negative Results in NLP

Previous work has shown how to effectively use external resources such as dictionaries to improve English-language word embeddings, either by manipulating the training process or by applying post-hoc adjustments to the embedding space. We experiment with a multi-task learning approach for explicitly incorporating the structured elements of dictionary entries, such as user-assigned tags and usage examples, when learning embeddings for dictionary headwords. Our work generalizes several existing models for learning word embeddings from dictionaries. However, we find that the most effective representations overall are learned by simply training with a skip-gram objective over the concatenated text of all entries in the dictionary, giving no particular focus to the structure of the entries.

Living Machines: A study of atypical animacy
Mariona Coll Ardanuy | Federico Nanni | Kaspar Beelen | Kasra Hosseini | Ruth Ahnert | Jon Lawrence | Katherine McDonough | Giorgia Tolfo | Daniel CS Wilson | Barbara McGillivray
Proceedings of the 28th International Conference on Computational Linguistics

This paper proposes a new approach to animacy detection, the task of determining whether an entity is represented as animate in a text. In particular, this work is focused on atypical animacy and examines the scenario in which typically inanimate objects, specifically machines, are given animate attributes. To address it, we have created the first dataset for atypical animacy detection, based on nineteenth-century sentences in English, with machines represented as either animate or inanimate. Our method builds on recent innovations in language modeling, specifically BERT contextualized word embeddings, to better capture fine-grained contextual properties of words. We present a fully unsupervised pipeline, which can be easily adapted to different contexts, and report its performance on an established animacy dataset and our newly introduced resource. We show that our method provides a substantially more accurate characterization of atypical animacy, especially when applied to highly complex forms of language use.

SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection
Dominik Schlechtweg | Barbara McGillivray | Simon Hengchen | Haim Dubossarsky | Nina Tahmasebi
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Lexical Semantic Change detection, i.e., the task of identifying words that change meaning over time, is a very active research area, with applications in NLP, lexicography, and linguistics. Evaluation is currently the most pressing problem in Lexical Semantic Change detection, as no gold standards are available to the community, which hinders progress. We present the results of the first shared task that addresses this gap by providing researchers with an evaluation framework and manually annotated, high-quality datasets for English, German, Latin, and Swedish. 33 teams submitted 186 systems, which were evaluated on two subtasks.

Urban Dictionary Embeddings for Slang NLP Applications
Steven Wilson | Walid Magdy | Barbara McGillivray | Kiran Garimella | Gareth Tyson
Proceedings of the Twelfth Language Resources and Evaluation Conference

The choice of the corpus on which word embeddings are trained can have a sizable effect on the learned representations, the types of analyses that can be performed with them, and their utility as features for machine learning models. To contribute to the existing sets of pre-trained word embeddings, we introduce and release the first set of word embeddings trained on the content of Urban Dictionary, a crowd-sourced dictionary for slang words and phrases. We show that although these embeddings are trained on fewer total tokens (by at least an order of magnitude compared to most popular pre-trained embeddings), they have high performance across a range of common word embedding evaluations, ranging from semantic similarity to word clustering tasks. Further, for some extrinsic tasks such as sentiment analysis and sarcasm detection where we expect to require some knowledge of colloquial language on social media data, initializing classifiers with the Urban Dictionary Embeddings resulted in improved performance compared to initializing with a range of other well-known, pre-trained embeddings that are order of magnitude larger in size.

2019

Vector space models of Ancient Greek word meaning, and a case study on Homer
Martina Astrid Rodda | Philomen Probert | Barbara McGillivray
Traitement Automatique des Langues, Volume 60, Numéro 3 : TAL et humanités numériques [NLP and Digital Humanities]

Room to Glo: A Systematic Comparison of Semantic Change Detection Approaches with Word Embeddings
Philippa Shoemark | Farhana Ferdousi Liza | Dong Nguyen | Scott A. Hale | Barbara McGillivray
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Word embeddings are increasingly used for the automatic detection of semantic change; yet, a robust evaluation and systematic comparison of the choices involved has been lacking. We propose a new evaluation framework for semantic change detection and find that (i) using the whole time series is preferable over only comparing between the first and last time points; (ii) independently trained and aligned embeddings perform better than continuously trained embeddings for long time periods; and (iii) that the reference point for comparison matters. We also present an analysis of the changes detected on a large Twitter dataset spanning 5.5 years.

GASC: Genre-Aware Semantic Change for Ancient Greek
Valerio Perrone | Marco Palma | Simon Hengchen | Alessandro Vatri | Jim Q. Smith | Barbara McGillivray
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

Word meaning changes over time, depending on linguistic and extra-linguistic factors. Associating a word’s correct meaning in its historical context is a central challenge in diachronic research, and is relevant to a range of NLP tasks, including information retrieval and semantic search in historical texts. Bayesian models for semantic change have emerged as a powerful tool to address this challenge, providing explicit and interpretable representations of semantic change phenomena. However, while corpora typically come with rich metadata, existing models are limited by their inability to exploit contextual information (such as text genre) beyond the document time-stamp. This is particularly critical in the case of ancient languages, where lack of data and long diachronic span make it harder to draw a clear distinction between polysemy (the fact that a word has several senses) and semantic change (the process of acquiring, losing, or changing senses), and current systems perform poorly on these languages. We develop GASC, a dynamic semantic change model that leverages categorical metadata about the texts’ genre to boost inference and uncover the evolution of meanings in Ancient Greek corpora. In a new evaluation framework, our model achieves improved predictive performance compared to the state of the art.

Mining the UK Web Archive for Semantic Change Detection
Adam Tsakalidis | Marya Bazzi | Mihai Cucuringu | Pierpaolo Basile | Barbara McGillivray
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Semantic change detection (i.e., identifying words whose meaning has changed over time) started emerging as a growing area of research over the past decade, with important downstream applications in natural language processing, historical linguistics and computational social science. However, several obstacles make progress in the domain slow and difficult. These pertain primarily to the lack of well-established gold standard datasets, resources to study the problem at a fine-grained temporal resolution, and quantitative evaluation approaches. In this work, we aim to mitigate these issues by (a) releasing a new labelled dataset of more than 47K word vectors trained on the UK Web Archive over a short time-frame (2000-2013); (b) proposing a variant of Procrustes alignment to detect words that have undergone semantic shift; and (c) introducing a rank-based approach for evaluation purposes. Through extensive numerical experiments and validation, we illustrate the effectiveness of our approach against competitive baselines. Finally, we also make our resources publicly available to further enable research in the domain.

2010

Automatic Selectional Preference Acquisition for Latin Verbs
Barbara McGillivray
Proceedings of the ACL 2010 Student Research Workshop

2009

The Development of the “Index Thomisticus” Treebank Valency Lexicon
Barbara McGillivray | Marco Passarotti
Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education (LaTeCH – SHELT&R 2009)

The Index Thomisticus Treebank Project: Annotation, Parsing and Valency Lexicon
Barbara McGillivray | Marco Passarotti | Paolo Ruffolo
Traitement Automatique des Langues, Volume 50, Numéro 2 : Langues anciennes [Ancient Languages]

2008

Unsupervised Acquisition of Verb Subcategorization Frames from Shallow-Parsed Corpora
Alessandro Lenci | Barbara McGillivray | Simonetta Montemagni | Vito Pirrelli
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, we reported experiments of unsupervised automatic acquisition of Italian and English verb subcategorization frames (SCFs) from general and domain corpora. The proposed technique operates on syntactically shallow-parsed corpora on the basis of a limited number of search heuristics not relying on any previous lexico-syntactic knowledge about SCFs. Although preliminary, reported results are in line with state-of-the-art lexical acquisition systems. The issue of whether verbs sharing similar SCFs distributions happen to share similar semantic properties as well was also explored by clustering verbs that share frames with the same distribution using the Minimum Description Length Principle (MDL). First experiments in this direction were carried out on Italian verbs with encouraging results.

Semantic Structure from Correspondence Analysis
Barbara McGillivray | Christer Johansson | Daniel Apollon
Coling 2008: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing

Co-authors

Simon Hengchen 3

Chaya Liebeskind 3

Paola Marongiu 3

Ciprian-Octavian Truică 3

Giedrė Valūnaitė-Oleškevičienė 3

Steven Wilson 3

Andrea Ballatore 2

Kaspar Beelen 2

Mariona Coll Ardanuy 2

Haim Dubossarsky 2

Andrea Farina 2

Kasra Hosseini 2

Federico Nanni 2

Marco Passarotti 2

Nilo Pedrazzini 2

Dominik Schlechtweg 2

Lucia Siciliani 2

Nina Tahmasebi 2

Giorgia Tolfo 2

Malithi Alahapperuma 1

Daniel Apollon 1

Pierluigi Cassotti 1

Christian Chiarcos 1

Michele Ciletti 1

Jonathan Cook 1

Mihai Cucuringu 1

Davide Di Pierro 1

Stefano Ferilli 1

Kiran Garimella 1

Iacopo Ghinassi 1

Scott A. Hale 1

Viola Harperath 1

Christer Johansson 1

Alessandro Lenci 1

Farhana Ferdousi Liza 1

Katherine McDonough 1

Simonetta Montemagni 1

Roberto Navigli 1

Malvina Nissim 1

Valerio Perrone 1

Vito Pirrelli 1

Philomen Probert 1

Marton Ribary 1

Martina Astrid Rodda 1

Paolo Ruffolo 1

Philippa Shoemark 1

Silvia Stopponi 1

Simone Tedeschi 1

Adam Tsakalidis 1

Alessandro Vatri 1

Mandy Wigdorowitz 1

Daniel CS Wilson 1

Venues