Proceedings of the First Workshop on Holocaust Testimonies as Language Resources (HTRes) @ LREC-COLING 2024

Isuri Anuradha, Martin Wynne, Francesca Frontini, Alistair Plum (Editors)

Anthology ID:: 2024.htres-1
Month:: May
Year:: 2024
Address:: Torino, Italia
Venues:: htres | WS
Events:: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) | Workshop on Holocaust Testimonies as Language Resources (2024) | The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) | Other Workshops and Events (2024)
SIG:
Publisher:: ELRA and ICCL
URL:: https://aclanthology.org/2024.htres-1/
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2024.htres-1.pdf

Proceedings of the First Workshop on Holocaust Testimonies as Language Resources (HTRes) @ LREC-COLING 2024
Isuri Anuradha | Martin Wynne | Francesca Frontini | Alistair Plum

pdf bib abs

The Impact of Digital Editing on the Study of Holocaust Survivors’ Testimonies in the context of Voci dall’Inferno Project
Angelo Mario Del Grosso | Marina Riccucci | Elvira Mercatanti

In Nazi concentration camps, approximately 20 million people perished. This included young and old, men and women, Jews, dissidents, and homosexuals. Only 10% of those deported survived. This paper introduces “Voci dall’Inferno” project, which aims to achieve two key objectives: a) Create a comprehensive digital archive: by encoding a corpus of non-literary testimonies including both written and oral sources. b) Analyze the use of Dante’s language: by identifying the presence of Dante’s lexicon and allusions. Currently, the project holds 47 testimonies, with 29 transcribed in full text and 18 encoded using the XML-TEI format. This project is propelled by a multidisciplinary and educational context with experts in humanities and computer science. The project’s findings will be disseminated through a user-friendly web application built on an XML foundation. Though currently in its prototyping phase, the application boasts several features, including a search engine for testimonies, terms, or phrases within the corpus. Additionally, a browsing interface allows users to read and listen the original testimonies, while a visualization tool enables deeper exploration of the corpus’s content. Adhering to the Text Encoding Initiative (TEI) guidelines, the project ensures a structured digital archive, aligned with the FAIR principles for data accessibility and reusability.

pdf bib abs

TEI Specifications for a Sustainable Management of Digitized Holocaust Testimonies
Sarah Bénière | Floriane Chiffoleau | Laurent Romary

Data modeling and standardization are central issues in the field of Digital Humanities, and all the more so when dealing with Holocaust testimonies, where stable preservation and long-term accessibility are key. The EHRI Online Editions are composed of documents of diverse nature (testimonies, letters, diplomatic reports, etc.), held by EHRI’s partnering institutions, and selected, gathered thematically and encoded according to the TEI Guidelines by the editors within the EHRI Consortium. Standardization is essential in order to make sure that the editions are consistent with one another. The issue of consistency also encourages a broader reflection on the usage of standards when processing data, and on the standardization of digital scholarly editions of textual documents in general. In this paper, we present the normalization work we carried out on the EHRI Online Editions. It includes a customization of the TEI adapted to Holocaust-related documents, and a focus on the implementation of controlled vocabulary. We recommend the use of these encoding specifications as a tool for researchers and/or non-TEI experts to ensure their encoding is valid and consistent across editions, but also as a mechanism for integrating the edition work smoothly within a wider workflow leading from image digitization to publication.

pdf bib abs

Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools
Maria Dermentzi | Hugo Scheithauer

The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable. With this paper, we release EHRI-NER, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for Named Entity Recognition (NER) in Holocaust-related texts. EHRI-NER is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. We leverage this dataset to fine-tune the multilingual Transformer-based language model XLM-RoBERTa (XLM-R) to determine whether a single model can be trained to recognize entities across different document types and languages. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5%. We argue that this score is sufficiently high to consider the next steps towards deploying this model.

pdf bib abs

Dates and places as points of attachment for memorial contents in the ISW corpus: 1938 as a turning point
Carolina Flinz | Simona Leonardi

Aim of the paper is the identification and subsequent analysis of crisis years in the narrative biographical interviews with German speaking Jews from the corpus ISW (Emigrantendeutsch in Israel: Wiener in Jerusalem/ Migrant German in Israel: Viennese in Jerusalem); also the possible “chronological landmarks” within a year will be tackled, investigating how a certain year – 1938 – represents in the life story of the narrators a turning point, as it clusters most traumatic events linked to the Shoah. The transcripts were analysed using the tool Sketch Engine. An alternation of corpus-driven and corpus-based steps characterizes this study, which uses a quantitative-qualitative approach (see Lemnitzer and Zinsmeister, 2015) and integrates also approaches from narrative analysis. The research questions that guide our investigation are as follows: Are there any special dates that recur as chronological landmarks of crisis situations (Leonardi 2023a)? Which are they? Do they recur in connection with special places? which ones?

pdf bib abs

Creating a Typology of Places to Annotate Holocaust Testimonies Through Machine Learning
Christine Liu | William J.B. Mattingly

The Holocaust was not only experienced in iconic places like Auschwitz or the Warsaw ghetto. Ordinary places, such as city streets, forests, hills, and homes, were transformed by occupation and systematic violence. While most of these places are unnamed and locationally ambiguous, their omnipresence throughout post-war testimonies from witnesses and survivors of the Holocaust emphasize their undeniable importance. This paper shares a methodology for developing a typology of places in order to annotate both named and unnamed places within interview transcripts from the United States Holocaust Memorial Museum (USHMM) through a machine learning model. The approach underscores the benefits of hybrid analysis through both automated extraction and manual review to create distinct categories of places. This paper also reviews how testimony transcripts were converted into structured data for annotation and previews ongoing work to design a search engine for users to dynamically query this place-based approach to studying the Holocaust.

pdf bib abs

Speech Technology Services for Oral History Research
Christoph Draxler | Henk van den Heuvel | Arjan van Hessen | Pavel Ircing | Jan Lehečka

Oral history is about oral sources of witnesses and commentors on historical events. Speech technology is an important instrument to process such recordings in order to obtain transcription and further enhancements to structure the oral account In this contribution we address the transcription portal and the webservices associated with speech processing at BAS, speech solutions developed at LINDAT, how to do it yourself with Whisper, remaining challenges, and future developments.

pdf bib abs

Identifying Narrative Patterns and Outliers in Holocaust Testimonies Using Topic Modeling
Maxim Ifergan | Omri Abend | Renana Keydar | Amit Pinchevski

The vast collection of Holocaust survivor testimonies presents invaluable historical insights but poses challenges for manual analysis. This paper leverages advanced Natural Language Processing (NLP) techniques to explore the USC Shoah Foundation Holocaust testimony corpus. By treating testimonies as structured question-and-answer sections, we apply topic modeling to identify key themes. We experiment with BERTopic, which leverages recent advances in language modeling technology. We align testimony sections into fixed parts, revealing the evolution of topics across the corpus of testimonies. This highlights both a common narrative schema and divergences between subgroups based on age and gender. We introduce a novel method to identify testimonies within groups that exhibit atypical topic distributions resembling those of other groups. This study offers unique insights into the complex narratives of Holocaust survivors, demonstrating the power of NLP to illuminate historical discourse and identify potential deviations in survivor experiences.

pdf bib abs

Tracing the deportation to define Holocaust geometries. The exploratory case of Milan
Giovanni Pietro Vitali | Laura Brazzo

This paper presents a pilot project conducted in collaboration with the Fondazione CDEC to shed light on the historical dynamics of the arrests and deportations of Jews from Italy to foreign concentration camps between 1943 and 1945. Led by a multidisciplinary team, including a Digital Humanities expert, an archivist, a GIS developer, and an education manager, the project aimed to rework archival information into data visualisation models utilising a subset of data from the CDEC LOD dataset of the victims of the Holocaust in Italy to construct detailed visual representations of deportation routes. Drawing inspiration from previous projects like the Atlas of Nazi-Fascist Massacres and research on Holocaust testimonies, this project sought to create interactive maps, network and graphs illustrating the paths of forced transfers endured by arrested Jews, particularly focusing on those born or arrested in Milan. Despite challenges such as incomplete or imprecise data, the team managed to reconstruct deportation routes and classify transport convoys, enhancing the understanding of this dark period in history. The visualisations, along with detailed repositories and links provided on GitHub, serve as valuable research tools for both scholarly and educational purposes, offering users varying levels of granularity to explore historical events and timelines. Through meticulous data analysis and visualisation techniques, this project contributes to ongoing efforts to preserve and understand the tragic events of the Holocaust, emphasizing the importance of archival work and interdisciplinary collaboration in historical research.

pdf bib abs

Zero-shot Trajectory Mapping in Holocaust Testimonies
Eitan Wagner | Renana Keydar | Omri Abend

This work presents the task of Zero-shot Trajectory Mapping, which focuses on the spatial dimension of narratives. The task consists of two parts: (1) creating a “map” with all the locations mentioned in a set of texts, and (2) extracting a trajectory from a single testimony and positioning it within the map. Following recent advances in context length capabilities of large language models, we propose a pipeline for this task in a completely unsupervised manner, without the requirement of any type of labels. We demonstrate the pipeline on a set of ≈ 75 testimonies and present the resulting map and samples of the trajectory. We conclude that current long-range models succeed in generating meaningful maps and trajectories. Other than the visualization and indexing, we propose future directions for adaptation of the task as a step for dividing testimony sets into clusters and for alignment between parallel parts of different testimonies.