Workshop on Natural Language Processing for Digital Humanities (2024)

Volumes

Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities 54 papers

pdf (full)
bib (full) Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Mika Hämäläinen | Emily Öhman | So Miyagawa | Khalid Alnajjar | Yuri Bizzoni

pdf bib abs

Text Length and the Function of Intentionality: A Case Study of Contrastive Subreddits
Emily Sofi Ohman | Aatu Liimatta

Text length is of central concern in natural language processing (NLP) tasks, yet it is very much under-researched. In this paper, we use social media data, specifically Reddit, to explore the function of text length and intentionality by contrasting subreddits of the same topic where one is considered more serious/professional/academic and the other more relaxed/beginner/layperson. We hypothesize that word choices are more deliberate and intentional in the more in-depth and professional subreddits with texts subsequently becoming longer as a function of this intentionality. We argue that this has deep implications for many applied NLP tasks such as emotion and sentiment analysis, fake news and disinformation detection, and other modeling tasks focused on social media and similar platforms where users interact with each other via the medium of text.

pdf bib abs

Tracing the Genealogies of Ideas with Sentence Embeddings
Lucian Li

Detecting intellectual influence in unstructured text is an important problem for a wide range of fields, including intellectual history, social science, and bibliometrics. A wide range of previous studies in computational social science and digital humanities have attempted to resolve this through a range of dictionary, embedding, and language model based methods. I introduce an approach which leverages a sentence embedding index to efficiently search for similar ideas in a large historical corpus. This method remains robust in conditions of high OCR error found in real mass digitized historical corpora that disrupt previous published methods, while also capturing paraphrase and indirect influence. I evaluate this method on a large corpus of 250,000 nonfiction texts from the 19th century, and find that discovered influence is in line with history of science literature. By expanding the scope of our search for influence and the origins of ideas beyond traditional structured corpora and canonical works and figures, we can get a more nuanced perspective on influence and idea dissemination that can encompass epistemically marginalized groups.

pdf bib abs

Evaluating Computational Representations of Character: An Austen Character Similarity Benchmark
Funing Yang | Carolyn Jane Anderson

Several systems have been developed to extract information about characters to aid computational analysis of English literature. We propose character similarity grouping as a holistic evaluation task for these pipelines. We present AustenAlike, a benchmark suite of character similarities in Jane Austen’s novels. Our benchmark draws on three notions of character similarity: a structurally defined notion of similarity; a socially defined notion of similarity; and an expert defined set extracted from literary criticism. We use AustenAlike to evaluate character features extracted using two pipelines, BookNLP and FanfictionNLP. We build character representations from four kinds of features and compare them to the three AustenAlike benchmarks and to GPT-4 similarity rankings. We find that though computational representations capture some broad similarities based on shared social and narrative roles, the expert pairings in our third benchmark are challenging for all systems, highlighting the subtler aspects of similarity noted by human readers.

pdf bib abs

Investigating Expert-in-the-Loop LLM Discourse Patterns for Ancient Intertextual Analysis
Ray Umphrey | Jesse Roberts | Lindsey Roberts

This study explores the potential of large language models (LLMs) for identifying and examining intertextual relationships within biblical, koine Greek texts. By evaluating the performance of LLMs on various intertextuality scenarios the study demonstrates that these models can detect direct quotations, allusions, and echoes between texts. The LLM’s ability to generate novel intertextual observations and connections highlights its potential to uncover new insights. However, the model also struggles with long query passages and the inclusion of false intertextual dependences, emphasizing the importance of expert evaluation. The expert-in-the-loop methodology presented offers a scalable approach for intertextual research into the complex web of intertextuality within and beyond the biblical corpus.

pdf bib abs

Extracting Relations from Ecclesiastical Cultural Heritage Texts
Giulia Cruciani

Motivated by the increasing volume of data and the necessity of getting valuable insights, this research describes the process of extracting entities and relations from Italian texts in the context of ecclesiastical cultural heritage data. Named Entity Recognition (NER) and Relation Extraction (RE) are paramount tasks in Natural Language Processing. This paper presents a traditional methodology based on a two-step procedure: firstly, a custom model for Named Entity Recognition extracts entities from data, and then, a multi-input neural network model is trained to perform Relation Classification as a multi-label classification problem. Data are provided by IDS&Unitelm (technological partner of the IT Services and National Office for Ecclesiastical Cultural Heritage and Religious Buildings of CEI, the Italian Episcopal Conference) and concerns biographical texts of 9,982 entities of type person, which can be accessed by the online portal BeWeb. This approach aims to enhance the organization and accessibility of ecclesiastical cultural heritage data, offering deeper insights into historical biographical records.

pdf bib abs

Constructing a Sentiment-Annotated Corpus of Austrian Historical Newspapers: Challenges, Tools, and Annotator Experience
Lucija Krusic

This study presents the development of a sentiment-annotated corpus of historical newspaper texts in Austrian German, addressing a gap in annotated corpora for Natural Language Processing in the field of Digital Humanities. Three annotators categorised 1005 sentences from two 19th-century periodicals into four sentiment categories: positive, negative, neutral, and mixed. The annotators, Masters and PhD students in Linguistics and Digital Humanities, are considered semi-experts and have received substantial training during this annotation study. Three tools were used and compared in the annotation process: Google Sheets, Google Forms and Doccano, and resulted in a gold standard corpus. The analysis revealed a fair to moderate inter-rater agreement (Fleiss’ kappa = 0.405) and an average percentage agreement of 45.7% for full consensus and 92.5% for majority vote. As majority vote is needed for the creation of a gold standard corpus, these results are considered sufficient, and the annotations reliable. The study also introduced comprehensive guidelines for sentiment annotation, which were essential to overcome the challenges posed by historical language and context. The annotators’ experience was assessed through a combination of standardised usability tests (NASA-TLX and UEQ-S) and a detailed custom-made user experience questionnaire, which provided qualitative insights into the difficulties and usability of the tools used. The questionnaire is an additional resource that can be used to assess usability and user experience assessments in future annotation studies. The findings demonstrate the effectiveness of semi-expert annotators and dedicated tools in producing reliable annotations and provide valuable resources, including the annotated corpus, guidelines, and a user experience questionnaire, for future sentiment analysis and annotation of Austrian historical texts. The sentiment-annotated corpus will be used as the gold standard for fine-tuning and evaluating machine learning models for sentiment analysis of Austrian historical newspapers with the topic of migration and minorities in a subsequent study.

pdf bib abs

It is a Truth Individually Acknowledged: Cross-references On Demand
Piper Vasicek | Courtni Byun | Kevin Seppi

Cross-references link source passages of text to other passages that elucidate the source passage in some way and can deepen human understanding. Despite their usefulness, however, good cross-references are hard to find, and extensive sets of cross-references only exist for the few most highly studied books such as the Bible, for which scholars have been collecting cross-references for hundreds of years. Therefore, we propose a new task: generate cross-references for user-selected text on demand. We define a metric, coverage, to evaluate task performance. We adapt several models to generate cross references, including an Anchor Words topic model, SBERT SentenceTransformers, and ChatGPT, and evaluate their coverage in both English and German on existing cross-reference datasets. While ChatGPT outperforms other models on these datasets, this is likely due to data contamination. We hand-evaluate performance on the well-known works of Jane Austen and a less-known science fiction series Sons of the Starfarers by Joe Vasicek, finding that ChatGPT does not perform as well on these works; sentence embeddings perform best. We experiment with newer LLMs and large context windows, and suggest that future work should focus on deploying cross-references on-demand with readers to determine their effectiveness in the wild.

pdf bib abs

Extracting position titles from unstructured historical job advertisements
Klara Venglarova | Raven Adam | Georg Vogeler

This paper explores the automated extraction of job titles from unstructured historical job advertisements, using a corpus of digitized German-language newspapers from 1850-1950. The study addresses the challenges of working with unstructured, OCR-processed historical data, contrasting with contemporary approaches that often use structured, digitally-born datasets when dealing with this text type. We compare four extraction methods: a dictionary-based approach, a rule-based approach, a named entity recognition (NER) mode, and a text-generation method. The NER approach, trained on manually annotated data, achieved the highest F1 score (0.944 using transformers model trained on GPU, 0.884 model trained on CPU), demonstrating its flexibility and ability to correctly identify job titles. The text-generation approach performs similarly (0.920). However, the rule-based (0.69) and dictionary-based (0.632) methods reach relatively high F1 Scores as well, while offering the advantage of not requiring extensive labeling of training data. The results highlight the complexities of extracting meaningful job titles from historical texts, with implications for further research into labor market trends and occupational history.

pdf bib abs

Language Resources From Prominent Born-Digital Humanities Texts are Still Needed in the Age of LLMs
Natalie Hervieux | Peiran Yao | Susan Brown | Denilson Barbosa

The digital humanities (DH) community fundamentally embraces the use of computerized tools for the study and creation of knowledge related to language, history, culture, and human values, in which natural language plays a prominent role. Many successful DH tools rely heavily on Natural Language Processing methods, and several efforts exist within the DH community to promote the use of newer and better tools. Nevertheless, most NLP research is driven by web corpora that are noticeably different from texts commonly found in DH artifacts, which tend to use richer language and refer to rarer entities. Thus, the near-human performance achieved by state-of-the-art NLP tools on web texts might not be achievable on DH texts. We introduce a dataset carefully created by computer scientists and digital humanists intended to serve as a reference point for the development and evaluation of NLP tools. The dataset is a subset of a born-digital textbase resulting from a prominent and ongoing experiment in digital literary history, containing thousands of multi-sentence excerpts that are suited for information extraction tasks. We fully describe the dataset and show that its language is demonstrably different than the corpora normally used in training language resources in the NLP community.

pdf bib abs

NLP for Digital Humanities: Processing Chronological Text Corpora
Adam Pawłowski | Tomasz Walkowiak

The paper focuses on the integration of Natural Language Processing (NLP) techniques to analyze extensive chronological text corpora. This research underscores the synergy between humanistic inquiry and computational methods, especially in the processing and analysis of sequential textual data known as lexical series. A reference workflow for chronological corpus analysis is introduced, outlining the methodologies applicable to the ChronoPress corpus, a data set that encompasses 22 years of Polish press from 1945 to 1966. The study showcases the potential of this approach in uncovering cultural and historical patterns through the analysis of lexical series. The findings highlight both the challenges and opportunities present in leveraging lexical series analysis within Digital Humanities, emphasizing the necessity for advanced data filtering and anomaly detection algorithms to effectively manage the vast and intricate datasets characteristic of this field.

pdf bib abs

A Multi-task Framework with Enhanced Hierarchical Attention for Sentiment Analysis on Classical Chinese Poetry: Utilizing Information from Short Lines
Quanqi Du | Veronique Hoste

Classical Chinese poetry has a long history, dating back to the 11th century BC. By investigating the sentiment expressed in the poetry, we can gain more insights in the emotional life and history development in ancient Chinese culture. To help improve the sentiment analysis performance in the field of classical Chinese poetry, we propose to utilize the unique information from the individual short lines that compose the poem, and introduce a multi-task framework with hierarchical attention enhanced with short line sentiment labels. Specifically, the multi-task framework comprises sentiment analysis for both the overall poem and the short lines, while the hierarchical attention consists of word- and sentence-level attention, with the latter enhanced with additional information from short line sentiments. Our experimental results showcase that our approach leveraging more fine-grained information from short lines outperforms the state-of-the-art, achieving an accuracy score of 72.88% and an F1-macro score of 71.05%.

pdf bib abs

Exploring Similarity Measures and Intertextuality in Vedic Sanskrit Literature
So Miyagawa | Yuki Kyogoku | Yuzuki Tsukagoshi | Kyoko Amano

This paper examines semantic similarity and intertextuality in selected texts from the Vedic Sanskrit corpus, specifically the Maitrāyaṇī Saṃhitā (MS) and Kāṭhaka-Saṃhitā (KS). Three computational methods are employed: Word2Vec for word embeddings, stylo package for stylometric analysis, and TRACER for text reuse detection. By comparing various sections of the texts at different granularities, patterns of similarity and structural alignment are uncovered, providing insights into textual relationships and chronology. Word embeddings capture semantic similarities, while stylometric analysis reveals clusters and components that differentiate the texts. TRACER identifies parallel passages, indicating probable instances of text reuse. The computational analysis corroborates previous philological studies, suggesting a shared period of composition between MS.1.9 and MS.1.7. This research highlights the potential of computational methods in studying ancient Sanskrit literature, complementing traditional approaches. The agreement among the methods strengthens the validity of the findings, and the visualizations offer a nuanced understanding of textual connections. The study demonstrates that smaller chunk sizes are more effective for detecting intertextual parallels, showcasing the power of these techniques in unraveling the complexities of ancient texts.

pdf bib abs

Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction
Laura Manrique-Gomez | Tony Montes | Arturo Rodriguez Herrera | Ruben Manrique

This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.

pdf bib abs

We examine the relationship between the canonization of Danish novels and their textual innovation and influence, taking the Danish Modern Breakthrough era (1870–1900) as a case study. We evaluate whether canonical novels introduced a significant textual novelty in their time, and explore their influence on the overall literary trend of the period. By analyzing the positions of canonical versus non-canonical novels in semantic space, we seek to better understand the link between a novel’s canonical status and its literary impact. Additionally, we examine the overall diversification of Modern Breakthrough novels during this significant period of rising literary readership. We find that canonical novels stand out from both the historical novel genre and non-canonical novels of the period. Our findings on diversification within and across groups indicate that the novels now regarded as canonical served as literary trendsetters of their time.

pdf bib abs

Deciphering psycho-social effects of Eating Disorder : Analysis of Reddit Posts using Large Language Model(LLM)s and Topic Modeling
Medini Chopra | Anindita Chatterjee | Lipika Dey | Partha Pratim Das

Eating disorders are a global health concern as they manifest in increasing numbers across all sections of society. Social network platforms have emerged as a dependable source of information about the disease, its effect, and its prevalence among different sections. This work lays the foundation for large-scale analysis of social media data using large language models (LLMs). We show that using LLMs can drastically reduce the time and resource requirements for garnering insights from large data repositories. With respect to ED, this work focuses on understanding its psychological impacts on both patients and those who live in their proximity. Social scientists can utilize the proposed approach to design more focused studies with better representative groups.

pdf bib abs

Topic-Aware Causal Intervention for Counterfactual Detection
Thong Thanh Nguyen | Truc-My Nguyen

Counterfactual statements, which describe events that did not or cannot take place, are beneficial to numerous NLP applications. Hence, we consider the problem of counterfactual detection (CFD) and seek to enhance the CFD models. Previous models are reliant on clue phrases to predict counterfactuality, so they suffer from significant performance drop when clue phrase hints do not exist during testing. Moreover, these models tend to predict non-counterfactuals over counterfactuals. To address these issues, we propose to integrate neural topic model into the CFD model to capture the global semantics of the input statement. We continue to causally intervene the hidden representations of the CFD model to balance the effect of the class labels. Extensive experiments show that our approach outperforms previous state-of-the-art CFD and bias-resolving methods in both the CFD and other bias-sensitive tasks.

pdf bib abs

UD for German Poetry
Stefanie Dipper | Ronja Laarmann-Quante

This article deals with the syntactic analysis of German-language poetry from different centuries. We use Universal Dependencies (UD) as our syntactic framework. We discuss particular challenges of the poems in terms of tokenization, sentence boundary recognition and special syntactic constructions. Our annotated corpus currently consists of 20 poems with a total of 2,162 tokens, which originate from the PoeTree.de corpus. We present some statistics on our annotations and also evaluate the automatic UD annotation from PoeTree.de using our annotations.

pdf bib abs

Molyé: A Corpus-based Approach to Language Contact in Colonial France
Rasul Dent | Juliette Janes | Thibault Clerice | Pedro Ortiz Suarez | Benoît Sagot

Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of intermediate forms. This work introduces a new open corpus, the Molyé corpus, which combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.

pdf bib abs

Vector Poetics: Parallel Couplet Detection in Classical Chinese Poetry
Maciej Kurzynski | Xiaotong Xu | Yu Feng

This paper explores computational approaches for detecting parallelism in classical Chinese poetry, a rhetorical device where two verses mirror each other in syntax, meaning, tone, and rhythm. We experiment with five classification methods: (1) verb position matching, (2) integrated semantic, syntactic, and word-segmentation analysis, (3) difference-based character embeddings, (4) structured examples (inner/outer couplets), and (5) GPT-guided classification. We use a manually annotated dataset, containing 6,125 pentasyllabic couplets, to evaluate performance. The results indicate that parallelism detection poses a significant challenge even for powerful LLMs such as GPT-4o, with the highest F1 score below 0.72. Nevertheless, each method contributes valuable insights into the art of parallelism in Chinese poetry, suggesting a new understanding of parallelism as a verbal expression of principal components in a culturally defined vector space.

pdf bib abs

Adapting Measures of Literality for Use with Historical Language Data
Adam Roussel

This paper concerns the adaptation of two existing computational measures relating to the estimation of the literality of expressions to enable their use in scenarios where data is scarce, as is usually the case with historical language data. Being able to determine an expression’s literality via statistical means could support a range of linguistic annotation tasks, such as those relating to metaphor, metonymy, and idiomatic expressions, however making this judgment is especially difficult for modern annotators of historical and ancient texts. Therefore we re-implement these measures using smaller corpora and count-based vectors more suited to these amounts of training data. The adapted measures are evaluated against an existing data set of particle verbs annotated with degrees of literality. The results were inconclusive, yielding low correlations between 0.05 and 0.10 (Spearman’s ρ). Further work is needed to determine which measures and types of data correspond to which aspects of literality.

pdf bib abs

Improving Latin Dependency Parsing by Combining Treebanks and Predictions
Hanna-Mari Kristiina Kupari | Erik Henriksson | Veronika Laippala | Jenna Kanerva

This paper introduces new models designed to improve the morpho-syntactic parsing of the five largest Latin treebanks in the Universal Dependencies (UD) framework. First, using two state-of-the-art parsers, Trankit and Stanza, along with our custom UD tagger, we train new models on the five treebanks both individually and by combining them into novel merged datasets. We also test the models on the CIRCSE test set. In an additional experiment, we evaluate whether this set can be accurately tagged using the novel LASLA corpus (https://github.com/CIRCSE/LASLA). Second, we aim to improve the results by combining the predictions of different models through an atomic morphological feature voting system. The results of our two main experiments demonstrate significant improvements, particularly for the smaller treebanks, with LAS scores increasing by 16.10 and 11.85%-points for UDante and Perseus, respectively (Gamba and Zeman, 2023a). Additionally, the voting system for morphological features (FEATS) brings improvements, especially for the smaller Latin treebanks: Perseus 3.15% and CIRCSE 2.47%-points. Tagging the CIRCSE set with our custom model using the LASLA model improves POS 6.71 and FEATS 11.04%-points respectively, compared to our best-performing UD PROIEL model. Our results show that larger datasets and ensemble predictions can significantly improve performance.

pdf bib abs

From N-grams to Pre-trained Multilingual Models For Language Identification
Thapelo Andrew Sindane | Vukosi Marivate

In this paper, we investigate the use of N-gram models and Large Pre-trained Multilingual models for Language Identification (LID) across 11 South African languages. For N-gram models, this study shows that effective data size selection remains crucial for establishing effective frequency distributions of the target languages, that efficiently model each language, thus, improving language ranking. For pre-trained multilingual models, we conduct extensive experiments covering a diverse set of massively pre-trained multilingual (PLM) models – mBERT, RemBERT, XLM-r, and Afri-centric multilingual models – AfriBERTa, Afro-XLMr, AfroLM, and Serengeti. We further compare these models with available large-scale Language Identification tools: Compact Language Detector v3 (CLD V3), AfroLID, GlotLID, and OpenLID to highlight the importance of focused-based LID. From these, we show that Serengeti is a superior model across models: N-grams to Transformers on average. Moreover, we propose a lightweight BERT-based LID model (za_BERT_lid) trained with NHCLT + Vukzenzele corpus, which performs on par with our best-performing Afri-centric models.

pdf bib abs

Visualising Changes in Semantic Neighbourhoods of English Noun Compounds over Time
Malak Rassem | Myrto Tsigkouli | Chris W. Jenkins | Filip Miletić | Sabine Schulte im Walde

This paper provides a framework and tool set for computing and visualising dynamic, time- specific semantic neighbourhoods of English noun-noun compounds and their constituents over time. Our framework not only identifies salient vector-space dimensions and neighbours in notoriously sparse data: we specifically bring together changes in meaning aspects and degrees of (non-)compositionality.

pdf bib abs

SEFLAG: Systematic Evaluation Framework for NLP Models and Datasets in Latin and Ancient Greek
Konstantin Schulz | Florian Deichsler

Literary scholars of Latin and Ancient Greek increasingly use natural language processing for their work, but many models and datasets are hard to use due to a lack of sustainable research data management. This paper introduces the Systematic Evaluation Framework for natural language processing models and datasets in Latin and Ancient Greek (SEFLAG), which consistently assesses language resources using common criteria, such as specific evaluation metrics, metadata and risk analysis. The framework, a work in progress in its initial phase, currently covers lemmatization and named entity recognition for both languages, with plans for adding dependency parsing and other tasks. For increased transparency and sustainability, a thorough documentation is included as well as an integration into the HuggingFace ecosystem. The combination of these efforts is designed to support researchers in their search for suitable models.

pdf bib abs

A Two-Model Approach for Humour Style Recognition
Mary Ogbuka Kenneth | Foaad Khosmood | Abbas Edalat

Humour, a fundamental aspect of human communication, manifests itself in various styles that significantly impact social interactions and mental health. Recognising different humour styles poses challenges due to the lack of established datasets and machine learning (ML) models. To address this gap, we present a new text dataset for humour style recognition, comprising 1463 instances across four styles (self-enhancing, self-deprecating, affiliative, and aggressive) and non-humorous text, with lengths ranging from 4 to 229 words. Our research employs various computational methods, including classic machine learning classifiers, text embedding models, and DistilBERT, to establish baseline performance. Additionally, we propose a two-model approach to enhance humour style recognition, particularly in distinguishing between affiliative and aggressive styles. Our method demonstrates an 11.61% improvement in f1-score for affiliative humour classification, with consistent improvements in the 14 models tested. Our findings contribute to the computational analysis of humour in text, offering new tools for studying humour in literature, social media, and other textual sources.

pdf bib abs

N-gram-Based Preprocessing for Sandhi Reversion in Vedic Sanskrit
Yuzuki Tsukagoshi | Ikki Ohmukai

This study aims to address the challenges posed by sandhi in Vedic Sanskrit, a phenomenon that complicates the computational analysis of Sanskrit texts. By focusing on sandhi reversion, the research seeks to improve the accuracy of processing Vedic Sanskrit, an older layer of the language. Sandhi, a phonological phenomenon, poses challenges for text processing in Sanskrit due to the fusion of word boundaries or the sound change around word boundaries. In this research, we developed a transformer-based model with a novel n-gram preprocessing strategy to improve the accuracy of sandhi reversion for Vedic. We created character-based n-gram texts of varying lengths (n = 2, 3, 4, 5, 6) from the Rigveda, the oldest Vedic text, and trained models on these texts to perform machine translation from post-sandhi to pre-sandhi forms. In the results, we found that the model trained with 5-gram text achieved the highest accuracy. This success is likely due to the 5-gram’s ability to capture the maximum phonemic context in which Vedic sandhi occurs, making it more effective for the task. These findings suggest that by leveraging the inherent characteristics of phonological changes in language, even simple preprocessing methods like n-gram segmentation can significantly improve the accuracy of complex linguistic tasks.

pdf bib abs

Enhancing Swedish Parliamentary Data: Annotation, Accessibility, and Application in Digital Humanities
Shafqat Mumtaz Virk | Claes Ohlsson | Nina Tahmasebi | Henrik Björck | Leif Runefelt

The Swedish bicameral parliament data presents a valuable textual resource that is of interest for many researches and scholars. The parliamentary texts offer many avenues for research including the study of how various affairs were run by governments over time. The Parliament proceedings are available in textual format, but in their original form, they are noisy and unstructured and thus hard to explore and investigate. In this paper, we report the transformation of the raw bicameral parliament data (1867-1970) into a structured lexical resource annotated with various word and document level attributes. The annotated data is then made searchable through two modern corpus infrastructure components which provide a wide array of corpus exploration, visualization, and comparison options. To demonstrate the practical utility of this resource, we present a case study examining the transformation of the concept of ‘market’ over time from a tangible physical entity to an abstract idea.

pdf bib abs

Evaluating Open-Source LLMs in Low-Resource Languages: Insights from Latvian High School Exams
Roberts Darģis | Guntis Bārzdiņš | Inguna Skadiņa | Normunds Grūzītis | Baiba Saulīte

The latest large language models (LLM) have significantly advanced natural language processing (NLP) capabilities across various tasks. However, their performance in low-resource languages, such as Latvian with 1.5 million native speakers, remains substantially underexplored due to both limited training data and the absence of comprehensive evaluation benchmarks. This study addresses this gap by conducting a systematic assessment of prominent open-source LLMs on natural language understanding (NLU) and natural language generation (NLG) tasks in Latvian. We utilize standardized high school centralized graduation exams as a benchmark dataset, offering relatable and diverse evaluation scenarios that encompass multiple-choice questions and complex text analysis tasks. Our experimental setup involves testing models from the leading LLM families, including Llama, Qwen, Gemma, and Mistral, with OpenAI’s GPT-4 serving as a performance reference. The results reveal that certain open-source models demonstrate competitive performance in NLU tasks, narrowing the gap with GPT-4. However, all models exhibit notable deficiencies in NLG tasks, specifically in generating coherent and contextually appropriate text analyses, highlighting persistent challenges in NLG for low-resource languages. These findings contribute to efforts to develop robust multilingual benchmarks and improve LLM performance in diverse linguistic contexts.

pdf bib abs

Computational Methods for the Analysis of Complementizer Variability in Language and Literature: The Case of Hebrew “she-” and “ki”
Avi Shmidman | Aynat Rubinstein

We demonstrate a computational method for analyzing complementizer variability within language and literature, focusing on Hebrew as a test case. The primary complementizers in Hebrew are “she-” and “ki”. We first run a large-scale corpus analysis to determine the relative preference for one or the other of these complementizers given the preceding verb. On top of this foundation, we leverage clustering methods to measure the degree of interchangeability between the complementizers for each verb. The resulting tables, which provide this information for all common complement-taking verbs in Hebrew, are a first-of-its-kind lexical resource which we provide to the NLP community. Upon this foundation, we demonstrate a computational method to analyze literary works for unusual and unexpected complementizer usages deserving of literary analysis.

pdf bib abs

In corpus linguistics, registers–language varieties suited to different contexts–have traditionally been defined by their situations of use, yet recent studies reveal significant situational variation within registers. Previous quantitative studies, however, have been limited to English, leaving this variation in other languages largely unexplored. To address this gap, we apply a quantitative situational analysis to a large multilingual web register corpus, using large language models (LLMs) to annotate texts in English, Finnish, French, Swedish, and Turkish for 23 situational parameters. Using clustering techniques, we identify six situational text types, such as “Advice”, “Opinion” and “Marketing”, each characterized by distinct situational features. We explore the relationship between these text types and traditional register categories, finding partial alignment, though no register maps perfectly onto a single cluster. These results support the quantitative approach to situational analysis and are consistent with earlier findings for English. Cross-linguistic comparisons show that language accounts for only a small part of situational variation within registers, suggesting registers are situationally similar across languages. This study demonstrates the utility of LLMs in multilingual register analysis and deepens our understanding of situational variation within registers.

pdf bib abs

Testing and Adapting the Representational Abilities of Large Language Models on Folktales in Low-Resource Languages
J. A. Meaney | Beatrice Alex | William Lamb

Folktales are a rich resource of knowledge about the society and culture of a civilisation. Digital folklore research aims to use automated techniques to better understand these folktales, and it relies on abstract representations of the textual data. Although a number of large language models (LLMs) claim to be able to represent low-resource langauges such as Irish and Gaelic, we present two classification tasks to explore how useful these representations are, and three adaptations to improve the performance of these models. We find that adapting the models to work with longer sequences, and continuing pre-training on the domain of folktales improves classification performance, although these findings are tempered by the impressive performance of a baseline SVM with non-contextual features.

pdf bib abs

Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus
Craig Messner | Tom Lippincott

We present a dataset of 19th century American literary orthovariant tokens with a novel layer of human-annotated dialect group tags designed to serve as the basis for computational experiments exploring literarily meaningful orthographic variation. We perform an initial broad set of experiments over this dataset using both token (BERT) and character (CANINE)-level contextual language models. We find indications that the “dialect effect” produced by intentional orthographic variation employs multiple linguistic channels, and that these channels are able to be surfaced to varied degrees given particular language modelling assumptions. Specifically, we find evidence showing that choice of tokenization scheme meaningfully impact the type of orthographic information a model is able to surface.

pdf bib abs

Automatic extraction of geographic information, including Location Referring Expressions (LREs), can aid humanities research in analyzing large collections of historical texts. In this study, to investigate how accurate pretrained Transformer language models (LMs) can extract LREs from historical texts, we evaluate two representative types of LMs, namely, masked language model and causal language model, using early modern and contemporary Japanese datasets. Our experimental results demonstrated the potential of contemporary LMs for historical texts, but also suggest the need for further model enhancement, such as pretraining on historical texts.

pdf bib abs

Evaluating LLM Performance in Character Analysis: A Study of Artificial Beings in Recent Korean Science Fiction
Woori Jang | Seohyon Jung

Literary works present diverse and complex character behaviors, often implicit or intentionally obscured, making character analysis an inherently challenging task. This study explores LLMs’ capability to identify and interpret behaviors of artificial beings in 11 award-winning contemporary Korean science fiction short stories. Focusing on artificial beings as a distinct class of characters, rather than on conventional human characters, adds to the multi-layered complexity of analysis. We compared two LLMs, Claude 3.5 Sonnet and GPT-4o, with human experts using a custom eight-label system and a unique agreement metric developed to capture the cognitive intricacies of literary interpretation. Human inter-annotator agreement was around 50%, confirming the subjectivity of literary comprehension. LLMs differed from humans in selected text spans but demonstrated high agreement in label assignment for correctly identified spans. LLMs notably excelled at discerning ‘actions’ as semantic units rather than isolated grammatical components. This study reaffirms literary interpretation’s multifaceted nature while expanding the boundaries of NLP, contributing to discussions about AI’s capacity to understand and interpret creative works.

pdf bib abs

Text vs. Transcription: A Study of Differences Between the Writing and Speeches of U.S. Presidents
Mina Rajaei Moghadam | Mosab Rezaei | Gülşat Aygen | Reva Freedman

Even after many years of research, answering the question of the differences between spoken and written text remains open. This paper aims to study syntactic features that can serve as distinguishing factors. To do so, we focus on the transcribed speeches and written books of United States presidents. We conducted two experiments to analyze high-level syntactic features. In the first experiment, we examine these features while controlling for the effect of sentence length. In the second experiment, we compare the high-level syntactic features with low-level ones. The results indicate that adding high-level syntactic features enhances model performance, particularly in longer sentences. Moreover, the importance of the prepositional phrases in a sentence increases with sentence length. We also find that these longer sentences with more prepositional phrases are more likely to appear in speeches than in written books by U.S. presidents.

pdf bib abs

Mitigating Biases to Embrace Diversity: A Comprehensive Annotation Benchmark for Toxic Language
Xinmeng Hou

This study introduces a prescriptive annotation benchmark grounded in humanities research to ensure consistent, unbiased labeling of offensive language, particularly for casual and non-mainstream language uses. We contribute two newly annotated datasets that achieve higher inter-annotator agreement between human and language model (LLM) annotations compared to original datasets based on descriptive instructions. Our experiments show that LLMs can serve as effective alternatives when professional annotators are unavailable. Moreover, smaller models fine-tuned on multi-source LLM-annotated data outperform models trained on larger, single-source human-annotated datasets. These findings highlight the value of structured guidelines in reducing subjective variability, maintaining performance with limited data, and embracing language diversity. Content Warning: This article only analyzes offensive language for academic purposes. Discretion is advised.

pdf bib abs

Classification of Buddhist Verses: The Efficacy and Limitations of Transformer-Based Models
Nikita Neveditsin | Ambuja Salgaonkar | Pawan Lingras | Vijay Mago

This study assesses the ability of machine learning to classify verses from Buddhist texts into two categories: Therigatha and Theragatha, attributed to female and male authors, respectively. It highlights the difficulties in data preprocessing and the use of Transformer-based models on Devanagari script due to limited vocabulary, demonstrating that simple statistical models can be equally effective. The research suggests areas for future exploration, provides the dataset for further study, and acknowledges existing limitations and challenges.

pdf bib abs

Web-scale corpora present valuable research opportunities but often lack detailed metadata, making them challenging to use in linguistics and social sciences. This study tackles this problem by exploring automatic methods to classify web corpora into specific categories, focusing on text registers such as Interactive Discussion and literary genres such as Politics and Social Sciences. We train two machine learning models to classify documents from the large web-crawled OSCAR dataset: a register classifier using the multilingual, manually annotated CORE corpus, and a genre classifier using a dataset based on Kindle US&UK. Fine-tuned from XLM-R Large, the register and genre classifiers achieved F1-scores of 0.74 and 0.70, respectively. Our analysis includes evaluating the distribution of the predicted text classes and examining the intersection of genre-register pairs using topic modelling. The results show expected combinations between certain registers and genres, such as the Lyrical register often aligning with the Literature & Fiction genre. However, most registers, such as Interactive Discussion, are divided across multiple genres, like Engineering & Transportation and Politics & Social Sciences, depending on the discussion topic. This enriched metadata provides valuable insights and supports new ways of studying digital cultural heritage.

pdf bib abs

Sui Generis: Large Language Models for Authorship Attribution and Verification in Latin
Svetlana Gorovaia | Gleb Schmidt | Ivan P. Yamshchikov

This paper evaluates the performance of Large Language Models (LLMs) in authorship attribu- tion and authorship verification tasks for Latin texts of the Patristic Era. The study showcases that LLMs can be robust in zero-shot author- ship verification even on short texts without sophisticated feature engineering. Yet, the mod- els can also be easily “mislead” by semantics. The experiments also demonstrate that steering the model’s authorship analysis and decision- making is challenging, unlike what is reported in the studies dealing with high-resource mod- ern languages. Although LLMs prove to be able to beat, under certain circumstances, the traditional baselines, obtaining a nuanced and truly explainable decision requires at best a lot of experimentation.

pdf bib abs

Enhancing Neural Machine Translation for Ainu-Japanese: A Comprehensive Study on the Impact of Domain and Dialect Integration
Ryo Igarashi | So Miyagawa

Neural Machine Translation (NMT) has revolutionized language translation, yet significant challenges persist for low-resource languages, particularly those with high dialectal variation and limited standardization. This comprehensive study focuses on the Ainu language, a critically endangered indigenous language of northern Japan, which epitomizes these challenges. We address the limitations of previous research through two primary strategies: (1) extensive corpus expansion encompassing diverse domains and dialects, and (2) development of innovative methods to incorporate dialect and domain information directly into the translation process. Our approach yielded substantial improvements in translation quality, with BLEU scores increasing from 32.90 to 39.06 (+6.16) for Japanese → Ainu and from 10.45 to 31.83 (+21.38) for Ainu → Japanese. Through rigorous experimentation and analysis, we demonstrate the crucial importance of integrating linguistic variation information in NMT systems for languages characterized by high diversity and limited resources. Our findings have broad implications for improving machine translation for other low-resource languages, potentially advancing preservation and revitalization efforts for endangered languages worldwide.

pdf bib abs

Exploring Large Language Models for Qualitative Data Analysis
Tim Fischer | Chris Biemann

This paper explores the potential of Large Language Models (LLMs) to enhance qualitative data analysis (QDA) workflows within the open-source QDA platform developed at our university. We identify several opportunities within a typical QDA workflow where AI assistance can boost researcher productivity and translate these opportunities into corresponding NLP tasks: document classification, information extraction, span classification, and text generation. A benchmark tailored to these QDA activities is constructed, utilizing English and German datasets that align with relevant use cases. Focusing on efficiency and accessibility, we evaluate the performance of three prominent open-source LLMs - Llama 3.1, Gemma 2, and Mistral NeMo - on this benchmark. Our findings reveal the promise of LLM integration for streamlining QDA workflows, particularly for English-language projects. Consequently, we have implemented the LLM Assistant as an opt-in feature within our platform and report the implementation details. With this, we hope to further democratize access to AI capabilities for qualitative data analysis.

pdf bib abs

Cross-Dialectal Transfer and Zero-Shot Learning for Armenian Varieties: A Comparative Analysis of RNNs, Transformers and LLMs
Chahan Vidal-Gorène | Nadi Tomeh | Victoria Khurshudyan

This paper evaluates lemmatization, POS-tagging, and morphological analysis for four Armenian varieties: Classical Armenian, Modern Eastern Armenian, Modern Western Armenian, and the under-documented Getashen dialect. It compares traditional RNN models, multilingual models like mDeBERTa, and large language models (ChatGPT) using supervised, transfer learning, and zero/few-shot learning approaches. The study finds that RNN models are particularly strong in POS-tagging, while large language models demonstrate high adaptability, especially in handling previously unseen dialect variations. The research highlights the value of cross-variational and in-context learning for enhancing NLP performance in low-resource languages, offering crucial insights into model transferability and supporting the preservation of endangered dialects.

pdf bib abs

Increasing the Difficulty of Automatically Generated Questions via Reinforcement Learning with Synthetic Preference for Cost-Effective Cultural Heritage Dataset Generation
William Thorne | Ambrose Robinson | Bohua Peng | Chenghua Lin | Diana Maynard

As the cultural heritage sector increasingly adopts technologies like Retrieval-Augmented Generation (RAG) to provide more personalised search experiences and enable conversations with collections data, the demand for specialised evaluation datasets has grown. While end-to-end system testing is essential, it’s equally important to assess individual components. We target the final, answering task, which is well-suited to Machine Reading Comprehension (MRC). Although existing MRC datasets address general domains, they lack the specificity needed for cultural heritage information. Unfortunately, the manual creation of such datasets is prohibitively expensive for most heritage institutions. This paper presents a cost-effective approach for generating domain-specific MRC datasets with increased difficulty using Reinforcement Learning from Human Feedback (RLHF) from synthetic preference data. Our method leverages the performance of existing question-answering models on a subset of SQuAD to create a difficulty metric, assuming that more challenging questions are answered correctly less frequently. This research contributes: (1) A methodology for increasing question difficulty using PPO and synthetic data; (2) Empirical evidence of the method’s effectiveness, including human evaluation; (3) An in-depth error analysis and study of emergent phenomena; and (4) An open-source codebase and set of three llama-2-chat adapters for reproducibility and adaptation.

pdf bib abs

Assessing Large Language Models in Translating Coptic and Ancient Greek Ostraca
Audric-Charles Wannaz | So Miyagawa

The advent of Large Language Models (LLMs) substantially raised the quality and lowered the cost of Machine Translation (MT). Can scholars working with ancient languages draw benefits from this new technology? More specifically, can current MT facilitate multilingual digital papyrology? To answer this question, we evaluate 9 LLMs in the task of MT with 4 Coptic and 4 Ancient Greek ostraca into English using 6 NLP metrics. We argue that some models have already reached a performance apt to assist human experts. As can be expected from the difference in training corpus size, all models seem to perform better with Ancient Greek than with Coptic, where hallucinations are markedly more common. In the Coptic texts, the specialised Coptic Translator (CT) competes closely with Claude 3 Opus for the rank of most promising tool, while Claude 3 Opus and GPT-4o compete for the same position in the Ancient Greek texts. We argue that MT now substantially heightens the incentive to work on multilingual corpora. This could have a positive and long-lasting effect on Classics and Egyptology and help reduce the historical bias in translation availability. In closing, we reflect upon the need to meet AI-generated translations with an adequate critical stance.

pdf bib abs

The Social Lives of Literary Characters: Combining citizen science and language models to understand narrative social networks
Andrew Piper | Michael Xu | Derek Ruths

Characters and their interactions are central to the fabric of narratives, playing a crucial role in developing readers’ social cognition. In this paper, we introduce a novel annotation framework that distinguishes between five types of character interactions, including bilateral and unilateral classifications. Leveraging the crowd-sourcing framework of citizen science, we collect a large dataset of manual annotations (N=13,395). Using this data, we explore how genre and audience factors influence social network structures in a sample of contemporary books. Our findings demonstrate that fictional narratives tend to favor more embodied interactions and exhibit denser and less modular social networks. Our work not only enhances the understanding of narrative social networks but also showcases the potential of integrating citizen science with NLP methodologies for large-scale narrative analysis.

pdf bib abs

Multi-word expressions in biomedical abstracts and their plain English adaptations
Sergei Bagdasarov | Elke Teich

This study analyzes the use of multi-word expressions (MWEs), prefabricated sequences of words (e.g. in this case, this means that, healthcare service, follow up) in biomedical abstracts and their plain language adaptations. While English academic writing became highly specialized and complex from the late 19th century onwards, recent decades have seen a rising demand for a lay-friendly language in scientific content, especially in the health domain, to bridge a communication gap between experts and laypersons. Based on previous research showing that MWEs are easier to process than non-formulaic word sequences of comparable length, we hypothesize that they can potentially be used to create a more reader-friendly language. Our preliminary results suggest some significant differences between complex and plain abstracts when it comes to the usage patterns and informational load of MWEs.

pdf bib abs

Assessing the Performance of ChatGPT-4, Fine-tuned BERT and Traditional ML Models on Moroccan Arabic Sentiment Analysis
Mohamed Hannani | Abdelhadi Soudi | Kristof Van Laerhoven

Large Language Models (LLMs) have demonstrated impressive capabilities in various natural language processing tasks across different languages. However, their performance in low-resource languages and dialects, such as Moroccan Arabic (MA), requires further investigation. This study evaluates the performance of ChatGPT-4, different fine-tuned BERT models, FastText as text representation, and traditional machine learning models on MA sentiment analysis. Experiments were done on two open source MA datasets: an X(Twitter) Moroccan Arabic corpus (MAC) and a Moroccan Arabic YouTube corpus (MYC) datasets to assess their capabilities on sentiment text classification. We compare the performance of fully fine-tuned and pre-trained Arabic BERT-based models with ChatGPT-4 in zero-shot settings.

pdf bib abs

Analyzing Pokémon and Mario Streamers’ Twitch Chat with LLM-based User Embeddings
Mika Hämäläinen | Jack Rueter | Khalid Alnajjar

We present a novel digital humanities method for representing our Twitch chatters as user embeddings created by a large language model (LLM). We cluster these embeddings automatically using affinity propagation and further narrow this clustering down through manual analysis. We analyze the chat of one stream by each Twitch streamer: SmallAnt, DougDoug and PointCrow. Our findings suggest that each streamer has their own type of chatters, however two categories emerge for all of the streamers: supportive viewers and emoji and reaction senders. Repetitive message spammers is a shared chatter category for two of the streamers.

pdf bib abs

Corpus Development Based on Conflict Structures in the Security Field and LLM Bias Verification
Keito Inoshita

This study investigates the presence of biases in large language models (LLMs), specifically focusing on how these models process and reflect inter-state conflict structures. Previous research has often lacked the standardized datasets necessary for a thorough and consistent evaluation of biases in this context. Without such datasets, it is challenging to accurately assess the impact of these biases on critical applications. To address this gap, we developed a diverse and high-quality corpus using a four-phase process. This process included generating texts based on international conflict-related keywords, enhancing emotional diversity to capture a broad spectrum of sentiments, validating the coherence and connections between texts, and conducting final quality assurance through human reviewers who are experts in natural language processing. Our analysis, conducted using this newly developed corpus, revealed subtle but significant negative biases in LLMs, particularly towards Eastern bloc countries such as Russia and China. These biases have the potential to influence decision-making processes in fields like national security and international relations, where accurate, unbiased information is crucial. The findings underscore the importance of evaluating and mitigating these biases to ensure the reliability and fairness of LLMs when applied in sensitive areas.

pdf bib abs

Generating Interpretations of Policy Announcements
Andreas Marfurt | Ashley Thornton | David Sylvan | James Henderson

Recent advances in language modeling have focused on (potentially multiple-choice) question answering, open-ended generation, or math and coding problems. We look at a more nuanced task: the interpretation of statements of political actors. To this end, we present a dataset of policy announcements and corresponding annotated interpretations, on the topic of US foreign policy relations with Russia in the years 1993 up to 2016. We analyze the performance of finetuning standard sequence-to-sequence models of varying sizes on predicting the annotated interpretations and compare them to few-shot prompted large language models. We find that 1) model size is not the main factor for success on this task, 2) finetuning smaller models provides both quantitatively and qualitatively superior results to in-context learning with large language models, but 3) large language models pick up the annotation format and approximate the category distribution with just a few in-context examples.

pdf bib abs

Order Up! Micromanaging Inconsistencies in ChatGPT-4o Text Analyses
Erkki Mervaala | Ilona Kousa

Large language model (LLM) applications have taken the world by storm in the past two years, and the academic sphere has not been an exception. One common, cumbersome task for researchers to attempt to automatise has been text annotation and, to an extent, analysis. Popular LLMs such as ChatGPT have been examined as a research assistant and as an analysis tool, and several discrepancies regarding both transparency and the generative content have been uncovered. Our research approaches the usability and trustworthiness of ChatGPT for text analysis from the point of view of an “out-of-the-box” zero-shot or few-shot setting, focusing on how the context window and mixed text types affect the analyses generated. Results from our testing indicate that both the types of the texts and the ordering of different kinds of texts do affect the ChatGPT analysis, but also that the context-building is less likely to cause analysis deterioration when analysing similar texts. Though some of these issues are at the core of how LLMs function, many of these caveats can be addressed by transparent research planning.

pdf bib abs

CIPHE: A Framework for Document Cluster Interpretation and Precision from Human Exploration
Anton Eklund | Mona Forsman | Frank Drewes

Document clustering models serve unique application purposes, which turns model quality into a property that depends on the needs of the individual investigator. We propose a framework, Cluster Interpretation and Precision from Human Exploration (CIPHE), for collecting and quantifying human interpretations of cluster samples. CIPHE tasks survey participants to explore actual document texts from cluster samples and records their perceptions. It also includes a novel inclusion task that is used to calculate the cluster precision in an indirect manner. A case study on news clusters shows that CIPHE reveals which clusters have multiple interpretation angles, aiding the investigator in their exploration.

pdf bib abs

Empowering Teachers with Usability-Oriented LLM-Based Tools for Digital Pedagogy
Melany Vanessa Macias | Lev Kharlashkin | Leo Einari Huovinen | Mika Hämäläinen

We present our work on two LLM-based tools that utilize artificial intelligence and creative technology to improve education. The first tool is a Moodle AI plugin, which helps teachers manage their course content more efficiently using AI-driven analysis, content generation, and an interactive chatbot. The second one is a curriculum planning tool that provides insight into the sustainability, work-life relevance, and workload of each course. Both of these tools have the common goal of integrating sustainable development goals (UN SDGs) into teaching, among other things. We will describe the usability-focused and user-centric approach we have embraced when developing these tools.