Daniel Hershcovich - ACL Anthology

Daniel Hershcovich

2025

Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users
Antonia Karamolegkou | Malvina Nikandrou | Georgios Pantazopoulos | Danae Sanchez Villegas | Phillip Rust | Ruchira Dhar | Daniel Hershcovich | Anders Søgaard
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals. We conduct a user survey to identify adoption patterns and key challenges users face with such technologies. Despite a high adoption rate of these models, our findings highlight concerns related to contextual understanding, cultural sensitivity, and complex scene understanding, particularly for individuals who may rely solely on them for visual interpretation. Informed by these results, we collate five user-centred tasks with image and video inputs, including a novel task on Optical Braille Recognition. Our systematic evaluation of twelve MLLMs reveals that further advancements are necessary to overcome limitations related to cultural context, multilingual support, Braille reading comprehension, assistive object recognition, and hallucinations. This work provides critical insights into the future direction of multimodal AI for accessibility, underscoring the need for more inclusive, robust, and trustworthy visual assistance technologies.

Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)
Vinodkumar Prabhakaran | Sunipa Dev | Luciana Benotti | Daniel Hershcovich | Yong Cao | Li Zhou | Laura Cabello | Ife Adebara
Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)

Dying or Departing? Euphemism Detection for Death Discourse in Historical Texts
Ali Al-Laith | Alexander Conroy | Jens Bjerring-Hansen | Bolette Pedersen | Carsten Levisen | Daniel Hershcovich
Proceedings of the 31st International Conference on Computational Linguistics

Euphemisms are a linguistic device used to soften discussions of sensitive or uncomfortable topics, with death being a prominent example. In this paper, we present a study on the detection of death-related euphemisms in historical literary texts from a corpus containing Danish and Norwegian novels from the late 19th century. We introduce an annotated dataset of euphemistic and literal references to death, including both common and rare euphemisms, ranging from well-established terms to more culturally nuanced expressions. We evaluate the performances of state-of-the-art pre-trained language models fine-tuned for euphemism detection. Our findings show that fixed, literal expressions of death became less frequent over time, while metaphorical euphemisms grew in prevalence. Additionally, euphemistic language was more common in historical novels, whereas contemporary novels tended to refer to death more literally, reflecting the rise of secularism. These results shed light on the shifting discourse on death during a period when the concept of death as final became prominent.

Beyond Demographics: Enhancing Cultural Value Survey Simulation with Multi-Stage Personality-Driven Cognitive Reasoning
Haijiang Liu | Qiyuan Li | Chao Gao | Yong Cao | Xiangyu Xu | Xun Wu | Daniel Hershcovich | Jinguang Gu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Introducing **MARK**, the **M**ulti-st**A**ge **R**easoning framewor**K** for cultural value survey response simulation, designed to enhance the accuracy, steerability, and interpretability of large language models in this task. The system is inspired by the type dynamics theory in the MBTI psychological framework for personality research. It effectively predicts and utilizes human demographic information for simulation: life-situational stress analysis, group-level personality prediction, and self-weighted cognitive imitation. Experiments on the World Values Survey show that MARK outperforms existing baselines by 10% accuracy and reduces the divergence between model predictions and human preferences. This highlights the potential of our framework to improve zero-shot personalization and help social scientists interpret model predictions.

Beyond Words: Exploring Cultural Value Sensitivity in Multimodal Models
Srishti Yadav | Zhi Zhang | Daniel Hershcovich | Ekaterina Shutova
Findings of the Association for Computational Linguistics: NAACL 2025

Investigating value alignment in Large Language Models (LLMs) based on cultural context has become a critical area of research. However, similar biases have not been extensively explored in large vision-language models (VLMs). As the scale of multimodal models continues to grow, it becomes increasingly important to assess whether images can serve as reliable proxies for culture and how these values are embedded through the integration of both visual and textual data. In this paper, we conduct a thorough evaluation of multimodal model at different scales, focusing on their alignment with cultural values. Our findings reveal that, much like LLMs, VLMs exhibit sensitivity to cultural values, but their performance in aligning with these values is highly context-dependent. While VLMs show potential in improving value understanding through the use of images, this alignment varies significantly across contexts highlighting the complexities and underexplored challenges in the alignment of multimodal models.

Do LLMs Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptations of Wine Reviews
Chenye Zou | Xingyue Wen | Tianyi Hu | Qian Janice Wang | Daniel Hershcovich
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent advances in large language models (LLMs) have opened the door to culture-aware language tasks. We introduce the novel problem of adapting wine reviews across Chinese and English, which goes beyond literal translation by incorporating regional taste preferences and culture-specific flavor descriptors. In a case study on cross-cultural wine review adaptation, we compile the first parallel corpus of professional reviews, containing 8k Chinese and 16k Anglophone reviews. We benchmark both neural-machine-translation baselines and state-of-the-art LLMs with automatic metrics and human evaluation. For the latter, we propose three culture-oriented criteria—Cultural Proximity, Cultural Neutrality, and Cultural Genuineness—to assess how naturally a translated review resonates with target-culture readers. Our analysis shows that current models struggle to capture cultural nuances, especially in translating wine descriptions across different cultures. This highlights the challenges and limitations of translation models in handling cultural content.

HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals
Guimin Hu | Daniel Hershcovich | Hasti Seifi
Findings of the Association for Computational Linguistics: EMNLP 2025

Haptic signals, from smartphone vibrations to virtual reality touch feedback, can effectively convey information and enhance realism, but designing signals that resonate meaningfully with users is challenging. To facilitate this, we introduce a multimodal dataset and task, of matching user descriptions to vibration haptic signals, and highlight two primary challenges: (1) lack of large haptic vibration datasets annotated with textual descriptions as collecting haptic descriptions is time-consuming, and (2) limited capability of existing tasks and models to describe vibration signals in text.To advance this area, we create HapticCap, the first fully human-annotated haptic-captioned dataset, containing 92,070 haptic-text pairs for user descriptions of sensory, emotional, and associative attributes of vibrations. Based on HapticCap, we propose the haptic-caption retrieval task and present the results of this task from a supervised contrastive learning framework that brings together text representations within specific categories and vibrations. Overall, the combination of language model T5 and audio model AST yields the best performance in the haptic-caption retrieval task, especially when separately trained for each description category. The dataset is available at https://huggingface.co/datasets/GuiminHu/HapticCap.

Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations
Yong Cao | Haijiang Liu | Arnav Arora | Isabelle Augenstein | Paul Röttger | Daniel Hershcovich
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large-scale surveys are essential tools for informing social science research and policy, but running surveys is costly and time-intensive. If we could accurately simulate group-level survey results, this would therefore be very valuable to social science research. Prior work has explored the use of large language models (LLMs) for simulating human behaviors, mostly through prompting. In this paper, we are the first to specialize LLMs for the task of simulating survey response distributions. As a testbed, we use country-level results from two global cultural surveys. We devise a fine-tuning method based on first-token probabilities to minimize divergence between predicted and actual response distributions for a given question. Then, we show that this method substantially outperforms other methods and zero-shot classifiers, even on unseen questions, countries, and a completely unseen survey. While even our best models struggle with the task, especially on unseen questions, our results demonstrate the benefits of specialization for simulation, which may accelerate progress towards sufficiently accurate simulation in the future.

Does Mapo Tofu Contain Coffee? Probing LLMs for Food-related Cultural Knowledge
Li Zhou | Taelin Karidi | Wanlong Liu | Nicolas Garneau | Yong Cao | Wenyu Chen | Haizhou Li | Daniel Hershcovich
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Recent studies have highlighted the presence of cultural biases in Large Language Models (LLMs), yet often lack a robust methodology to dissect these phenomena comprehensively. Our work aims to bridge this gap by delving into the Food domain—a universally relevant yet culturally diverse aspect of human life. We introduce FmLAMA, a multilingual dataset centered on food-related cultural facts and variations in food practices. We analyze LLMs across various architectures and configurations, evaluating their performance in both monolingual and multilingual settings. By leveraging templates in six different languages, we investigate how LLMs interact with language-specific and cultural knowledge. Our findings reveal that (1) LLMs demonstrate a pronounced bias towards food knowledge prevalent in the United States; (2) Incorporating relevant cultural context significantly improves LLMs’ ability to access cultural knowledge; (3) The efficacy of LLMs in capturing cultural nuances is highly dependent on the interplay between the probing language, the specific model architecture, and the cultural context in question. This research underscores the complexity of integrating cultural understanding into LLMs and emphasizes the importance of culturally diverse datasets to mitigate biases and enhance model performance across different cultural domains.

Annotating and Classifying Direct Speech in Historical Danish and Norwegian Literary Texts
Ali Al-Laith | Alexander Conroy | Kirstine Nielsen Degn | Jens Bjerring-Hansen | Daniel Hershcovich
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Analyzing direct speech in historical literary texts provides insights into character dynamics, narrative style, and discourse patterns. In late 19th century Danish and Norwegian fiction direct speech reflects characters’ social and geographical backgrounds. However, inconsistent typographic conventions in Scandinavian literature complicate computational methods for distinguishing direct speech from other narrative elements. To address this, we introduce an annotated dataset from the MeMo corpus, capturing speech markers and tags in Danish and Norwegian novels. We evaluate pre-trained language models for classifying direct speech, with results showing that a Danish Foundation Model (DFM), trained on extensive Danish data, has the highest performance. Finally, we conduct a classifier-assisted quantitative corpus analysis and find a downward trend in the prevalence of speech over time.

2024

Automated Sentence Generation for a Spaced Repetition Software
Benjamin Paddags | Daniel Hershcovich | Valkyrie Savage
Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024)

This paper presents and tests AllAI, an app that utilizes state-of-the-art NLP technology to assist second language acquisition through a novel method of sentence-based spaced repetition. Diverging from current single word or fixed sentence repetition, AllAI dynamically combines words due for repetition into sentences, enabling learning words in context while scheduling them independently. This research explores various suitable NLP paradigms and finds a few-shot prompting approach and retrieval of existing sentences from a corpus to yield the best correctness and scheduling accuracy. Subsequently, it evaluates these methods on 26 learners of Danish, finding a four-fold increase in the speed at which new words are learned, compared to conventional spaced repetition. Users of the retrieval method also reported significantly higher enjoyment, hinting at a higher user engagement.

Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP
Vinodkumar Prabhakaran | Sunipa Dev | Luciana Benotti | Daniel Hershcovich | Laura Cabello | Yong Cao | Ife Adebara | Li Zhou
Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP

Geo-Encoder: A Chunk-Argument Bi-Encoder Framework for Chinese Geographic Re-Ranking
Yong Cao | Ruixue Ding | Boli Chen | Xianzhi Li | Min Chen | Daniel Hershcovich | Pengjun Xie | Fei Huang
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Chinese geographic re-ranking task aims to find the most relevant addresses among retrieved candidates, which is crucial for location-related services such as navigation maps. Unlike the general sentences, Chinese geographic contexts are closely intertwined with geographical concepts, from general spans (e.g., province) to specific spans (e.g., road). Given this feature, we propose an innovative framework, namely Geo-Encoder, to more effectively integrate Chinese geographical semantics into re-ranking pipelines. Our methodology begins by employing off-the-shelf tools to associate text with geographical spans, treating them as chunking units. Then, we present a multi-task learning module to simultaneously acquire an effective attention matrix that determines chunk contributions to geographic representations. Furthermore, we put forth an asynchronous update mechanism for the proposed task, aiming to guide the model to focus on specific chunks. Experiments on two Chinese benchmark datasets, show that the Geo-Encoder achieves significant improvements when compared to state-of-the-art baselines. Notably, it leads to a substantial improvement in the Hit@1 score of MGEO-BERT, increasing it by 6.22% from 62.76 to 68.98 on the GeoTES dataset.

Bridging Cultures in the Kitchen: A Framework and Benchmark for Cross-Cultural Recipe Retrieval
Tianyi Hu | Maria Maistro | Daniel Hershcovich
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The cross-cultural adaptation of recipes is an important application of identifying and bridging cultural differences in language. The challenge lies in retaining the essence of the original recipe while also aligning with the writing and dietary habits of the target culture. Information Retrieval (IR) offers a way to address the challenge because it retrieves results from the culinary practices of the target culture while maintaining relevance to the original recipe. We introduce a novel task about cross-cultural recipe retrieval and present a unique Chinese-English cross-cultural recipe retrieval benchmark. Our benchmark is manually annotated under limited resource, utilizing various retrieval models to generate a pool of candidate results for manual annotation. The dataset provides retrieval samples that are culturally adapted but textually diverse, presenting greater challenges. We propose CARROT, a plug-and-play cultural-aware recipe information retrieval framework that incorporates cultural-aware query rewriting and re-ranking methods and evaluate it both on our benchmark and intuitive human judgments. The results show that our framework significantly enhances the preservation of the original recipe and its cultural appropriateness for the target culture. We believe these insights will significantly contribute to future research on cultural adaptation.

Noise, Novels, Numbers. A Framework for Detecting and Categorizing Noise in Danish and Norwegian Literature
Ali Al-Laith | Daniel Hershcovich | Jens Bjerring-Hansen | Jakob Ingemann Parby | Alexander Conroy | Timothy R Tangherlini
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We present a framework for detecting and categorizing noise in literary texts, demonstrated through its application to Danish and Norwegian literature from the late 19-th century. Noise, understood as “aberrant sonic behaviour,” is not only an auditory phenomenon but also a cultural construct tied to the processes of civilization and urbanization.We begin by utilizing topic modeling techniques to identify noise-related documents, followed by fine-tuning BERT-based language models trained on Danish and Norwegian texts to analyze a corpus of over 800 novels.We identify and track the prevalence of noise in these texts, offering insights into the literary perceptions of noise during the Scandinavian “Modern Breakthrough” period (1870-1899). Our contributions include the development of a comprehensive dataset annotated for noise-related segments and their categorization into human-made, non-human-made, and musical noises. This study illustrates the framework’s potential for enhancing the understanding of the relationship between noise and its literary representations, providing a deeper appreciation of the auditory elements in literary works, including as sources for cultural history.

FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture
Wenyan Li | Crystina Zhang | Jiaang Li | Qiwei Peng | Raphael Tang | Li Zhou | Weijia Zhang | Guimin Hu | Yifei Yuan | Anders Søgaard | Daniel Hershcovich | Desmond Elliott
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Food is a rich and varied dimension of cultural heritage, crucial to both individuals and social groups. To bridge the gap in the literature on the often-overlooked regional diversity in this domain, we introduce FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China. We evaluate vision–language Models (VLMs) and large language models (LLMs) on newly collected, unseen food images and corresponding questions. FoodieQA comprises three multiple-choice question-answering tasks where models need to answer questions based on multiple images, a single image, and text-only descriptions, respectively. While LLMs excel at text-based question answering, surpassing human accuracy, the open-sourced VLMs still fall short by 41% on multi-image and 21% on single-image VQA tasks, although closed-weights models perform closer to human levels (within 10%). Our findings highlight that understanding food and its cultural implications remains a challenging and under-explored direction.

Bridging Cultural Nuances in Dialogue Agents through Cultural Value Surveys
Yong Cao | Min Chen | Daniel Hershcovich
Findings of the Association for Computational Linguistics: EACL 2024

The cultural landscape of interactions with dialogue agents is a compelling yet relatively unexplored territory. It’s clear that various sociocultural aspects—from communication styles and beliefs to shared metaphors and knowledge—profoundly impact these interactions. To delve deeper into this dynamic, we introduce cuDialog, a first-of-its-kind benchmark for dialogue generation with a cultural lens. We also develop baseline models capable of extracting cultural attributes from dialogue exchanges, with the goal of enhancing the predictive accuracy and quality of dialogue agents. To effectively co-learn cultural understanding and multi-turn dialogue predictions, we propose to incorporate cultural dimensions with dialogue encoding features. Our experimental findings highlight that incorporating cultural value surveys boosts alignment with references and cultural markers, demonstrating its considerable influence on personalization and dialogue quality. To facilitate further exploration in this exciting domain, we publish our benchmark publicly accessible at https://github.com/yongcaoplus/cuDialog.

UniMEEC: Towards Unified Multimodal Emotion Recognition and Emotion Cause
Guimin Hu | Zhihong Zhu | Daniel Hershcovich | Lijie Hu | Hasti Seifi | Jiayuan Xie
Findings of the Association for Computational Linguistics: EMNLP 2024

Multimodal emotion recognition in conversation (MERC) and multimodal emotion-cause pair extraction (MECPE) have recently garnered significant attention. Emotions are the expression of affect or feelings; responses to specific events, or situations – known as emotion causes. Both collectively explain the causality between human emotion and intents. However, existing works treat emotion recognition and emotion cause extraction as two individual problems, ignoring their natural causality. In this paper, we propose a Unified Multimodal Emotion recognition and Emotion-Cause analysis framework (UniMEEC) to explore the causality between emotion and emotion cause. Concretely, UniMEEC reformulates the MERC and MECPE tasks as mask prediction problems and unifies them with a causal prompt template. To differentiate the modal effects, UniMEEC proposes a multimodal causal prompt to probe the pre-trained knowledge specified to modality and implements cross-task and cross-modality interactions under task-oriented settings. Experiment results on four public benchmark datasets verify the model performance on MERC and MECPE tasks and achieve consistent improvements compared with the previous state-of-the-art methods.

Vision-Language Models under Cultural and Inclusive Considerations
Antonia Karamolegkou | Phillip Rust | Ruixiang Cui | Yong Cao | Anders Søgaard | Daniel Hershcovich
Proceedings of the 1st Human-Centered Large Language Modeling Workshop

Large Vision Language Models can be used to assist visually impaired individuals by describing images they capture in their daily lives. Current evaluation datasets may not reflect the diverse cultural user backgrounds nor the situational context of this use case. To address this problem, we create a survey to determine caption preferences and propose a culture-centric evaluation benchmark by filtering VizWiz, an existing dataset with images taken by people who are blind. We then evaluate different models and prompts, investigating their reliability as visual assistants. While the evaluation results for state-of-the-art models seem promising, we identified some weak spots such as hallucinations and problems with conventional evaluation metrics. Our survey, data, code, and model outputs will be publicly available.

Can Abstract Meaning Representation Facilitate Fair Legal Judgement Predictions?
Supriti Vijay | Daniel Hershcovich
Proceedings of the Fifth Workshop on Insights from Negative Results in NLP

Legal judgment prediction encompasses the automated prediction of case outcomes by leveraging historical facts and opinions. While this approach holds the potential to enhance the efficiency of the legal system, it also raises critical concerns regarding the perpetuation of biases. Abstract Meaning Representation has shown promise as an intermediate text representation in various downstream NLP tasks due to its ability to capture semantically meaningful information in a graph-like structure. In this paper, we employ this ability of AMR in the legal judgement prediction task and assess to what extent it encodes biases, or conversely, abstracts away from them. Our study reveals that while AMR-based models exhibit worse overall performance than transformer-based models, they are less biased for attributes like age and defendant state compared to gender. By shedding light on these findings, this paper contributes to a more nuanced understanding of AMR’s potential benefits and limitations in legal NLP.

Development and Evaluation of Pre-trained Language Models for Historical Danish and Norwegian Literary Texts
Ali Al-Laith | Alexander Conroy | Jens Bjerring-Hansen | Daniel Hershcovich
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We develop and evaluate the first pre-trained language models specifically tailored for historical Danish and Norwegian texts. Three models are trained on a corpus of 19th-century Danish and Norwegian literature: two directly on the corpus with no prior pre-training, and one with continued pre-training. To evaluate the models, we utilize an existing sentiment classification dataset, and additionally introduce a new annotated word sense disambiguation dataset focusing on the concept of fate. Our assessment reveals that the model employing continued pre-training outperforms the others in two downstream NLP tasks on historical texts. Specifically, we observe substantial improvement in sentiment classification and word sense disambiguation compared to models trained on contemporary texts. These results highlight the effectiveness of continued pre-training for enhancing performance across various NLP tasks in historical text analysis.

Cultural Adaptation of Recipes
Yong Cao | Yova Kementchedjhieva | Ruixiang Cui | Antonia Karamolegkou | Li Zhou | Megan Dare | Lucia Donatelli | Daniel Hershcovich
Transactions of the Association for Computational Linguistics, Volume 12

Building upon the considerable advances in Large Language Models (LLMs), we are now equipped to address more sophisticated tasks demanding a nuanced understanding of cross-cultural contexts. A key example is recipe adaptation, which goes beyond simple translation to include a grasp of ingredients, culinary techniques, and dietary preferences specific to a given culture. We introduce a new task involving the translation and cultural adaptation of recipes between Chinese- and English-speaking cuisines. To support this investigation, we present CulturalRecipes, a unique dataset composed of automatically paired recipes written in Mandarin Chinese and English. This dataset is further enriched with a human-written and curated test set. In this intricate task of cross-cultural recipe adaptation, we evaluate the performance of various methods, including GPT-4 and other LLMs, traditional machine translation, and information retrieval techniques. Our comprehensive analysis includes both automatic and human evaluation metrics. While GPT-4 exhibits impressive abilities in adapting Chinese recipes into English, it still lags behind human expertise when translating English recipes into Chinese. This underscores the multifaceted nature of cultural adaptations. We anticipate that these insights will significantly contribute to future research on culturally aware language models and their practical application in culturally diverse contexts.

Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research. While the genealogical ties between Creoles and a number of highly resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of novel development datasets for reading comprehension relation classification, and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, we see CreoleVal as an opportunity to empower research on Creoles in NLP and computational linguistics, and in general, a step towards more equitable language technology around the globe.

2023

On Evaluating Multilingual Compositional Generalization with Translated Datasets
Zi Wang | Daniel Hershcovich
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Compositional generalization allows efficient learning and human-like inductive biases. Since most research investigating compositional generalization in NLP is done on English, important questions remain underexplored. Do the necessary compositional generalization abilities differ across languages? Can models compositionally generalize cross-lingually? As a first step to answering these questions, recent work used neural machine translation to translate datasets for evaluating compositional generalization in semantic parsing. However, we show that this entails critical semantic distortion. To address this limitation, we craft a faithful rule-based translation of the MCWQ dataset from English to Chinese and Japanese. Even with the resulting robust benchmark, which we call MCWQ-R, we show that the distribution of compositions still suffers due to linguistic divergences, and that multilingual models still struggle with cross-lingual compositional generalization. Our dataset and methodology will serve as useful resources for the study of cross-lingual compositional generalization in other tasks.

What does the Failure to Reason with “Respectively” in Zero/Few-Shot Settings Tell Us about Language Models?
Ruixiang Cui | Seolhwa Lee | Daniel Hershcovich | Anders Søgaard
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Humans can effortlessly understand the coordinate structure of sentences such as “Niels Bohr and Kurt Cobain were born in Copenhagen and Seattle, *respectively*”. In the context of natural language inference (NLI), we examine how language models (LMs) reason with respective readings (Gawron and Kehler, 2004) from two perspectives: syntactic-semantic and commonsense-world knowledge. We propose a controlled synthetic dataset WikiResNLI and a naturally occurring dataset NatResNLI to encompass various explicit and implicit realizations of “respectively”. We show that fine-tuned NLI models struggle with understanding such readings without explicit supervision. While few-shot learning is easy in the presence of explicit cues, longer training is required when the reading is evoked implicitly, leaving models to rely on common sense inferences. Furthermore, our fine-grained analysis indicates models fail to generalize across different constructions. To conclude, we demonstrate that LMs still lag behind humans in generalizing to the long tail of linguistic constructions.

What’s the Meaning of Superhuman Performance in Today’s NLU?
Simone Tedeschi | Johan Bos | Thierry Declerck | Jan Hajič | Daniel Hershcovich | Eduard Hovy | Alexander Koller | Simon Krek | Steven Schockaert | Rico Sennrich | Ekaterina Shutova | Roberto Navigli
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In the last five years, there has been a significant focus in Natural Language Processing (NLP) on developing larger Pretrained Language Models (PLMs) and introducing benchmarks such as SuperGLUE and SQuAD to measure their abilities in language understanding, reasoning, and reading comprehension. These PLMs have achieved impressive results on these benchmarks, even surpassing human performance in some cases. This has led to claims of superhuman capabilities and the provocative idea that certain tasks have been solved. In this position paper, we take a critical look at these claims and ask whether PLMs truly have superhuman abilities and what the current benchmarks are really evaluating. We show that these benchmarks have serious limitations affecting the comparison between humans and PLMs and provide recommendations for fairer and more transparent benchmarks.

Probing for Hyperbole in Pre-Trained Language Models
Nina Schneidermann | Daniel Hershcovich | Bolette Pedersen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Hyperbole is a common figure of speech, which is under-explored in NLP research. In this study, we conduct edge and minimal description length (MDL) probing experiments on three pre-trained language models (PLMs) in an attempt to explore the extent to which hyperbolic information is encoded in these models. We use both word-in-context and sentence-level representations as model inputs as a basis for comparison. We also annotate 63 hyperbole sentences from the HYPO dataset according to an operational taxonomy to conduct an error analysis to explore the encoding of different hyperbole categories. Our results show that hyperbole is to a limited extent encoded in PLMs, and mostly in the final layers. They also indicate that hyperbolic information may be better encoded by the sentence-level representations, which, due to the pragmatic nature of hyperbole, may therefore provide a more accurate and informative representation in PLMs. Finally, the inter-annotator agreement for our annotations, a Cohen’s Kappa of 0.339, suggest that the taxonomy categories may not be intuitive and need revision or simplification.

Cross-Cultural Transfer Learning for Chinese Offensive Language Detection
Li Zhou | Laura Cabello | Yong Cao | Daniel Hershcovich
Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP)

Detecting offensive language is a challenging task. Generalizing across different cultures and languages becomes even more challenging: besides lexical, syntactic and semantic differences, pragmatic aspects such as cultural norms and sensitivities, which are particularly relevant in this context, vary greatly. In this paper, we target Chinese offensive language detection and aim to investigate the impact of transfer learning using offensive language detection data from different cultural backgrounds, specifically Korean and English. We find that culture-specific biases in what is considered offensive negatively impact the transferability of language models (LMs) and that LMs trained on diverse cultural data are sensitive to different features in Chinese offensive language detection. In a few-shot learning scenario, however, our study shows promising prospects for non-English offensive language detection with limited resources. Our findings highlight the importance of cross-cultural transfer learning in improving offensive language detection and promoting inclusive digital spaces.

Assessing Cross-Cultural Alignment between ChatGPT and Human Societies: An Empirical Study
Yong Cao | Li Zhou | Seolhwa Lee | Laura Cabello | Min Chen | Daniel Hershcovich
Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP)

The recent release of ChatGPT has garnered widespread recognition for its exceptional ability to generate human-like conversations. Given its usage by users from various nations and its training on a vast multilingual corpus that includes diverse cultural and societal norms, it is crucial to evaluate its effectiveness in cultural adaptation. In this paper, we investigate the underlying cultural background of ChatGPT by analyzing its responses to questions designed to quantify human cultural differences. Our findings suggest that, when prompted with American context, ChatGPT exhibits a strong alignment with American culture, but it adapts less effectively to other cultural contexts. Furthermore, by using different prompts to probe the model, we show that English prompts reduce the variance in model responses, flattening out cultural differences and biasing them towards American culture. This study provides valuable insights into the cultural implications of ChatGPT and highlights the necessity of greater diversity and cultural awareness in language technologies.

A Two-Sided Discussion of Preregistration of NLP Research
Anders Søgaard | Daniel Hershcovich | Miryam de Lhoneux
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Van Miltenburg et al. (2021) suggest NLP research should adopt preregistration to prevent fishing expeditions and to promote publication of negative results. At face value, this is a very reasonable suggestion, seemingly solving many methodological problems with NLP research. We discuss pros and cons - some old, some new: a) Preregistration is challenged by the practice of retrieving hypotheses after the results are known; b) preregistration may bias NLP toward confirmatory research; c) preregistration must allow for reclassification of research as exploratory; d) preregistration may increase publication bias; e) preregistration may increase flag-planting; f) preregistration may increase p-hacking; and finally, g) preregistration may make us less risk tolerant. We cast our discussion as a dialogue, presenting both sides of the debate.

Pay More Attention to Relation Exploration for Knowledge Base Question Answering
Yong Cao | Xianzhi Li | Huiwen Liu | Wen Dai | Shuai Chen | Bin Wang | Min Chen | Daniel Hershcovich
Findings of the Association for Computational Linguistics: ACL 2023

Knowledge base question answering (KBQA) is a challenging task that aims to retrieve correct answers from large-scale knowledge bases. Existing attempts primarily focus on entity representation and final answer reasoning, which results in limited supervision for this task. Moreover, the relations, which empirically determine the reasoning path selection, are not fully considered in recent advancements. In this study, we propose a novel framework, RE-KBQA, that utilizes relations in the knowledge base to enhance entity representation and introduce additional supervision. We explore guidance from relations in three aspects, including (1) distinguishing similar entities by employing a variational graph auto-encoder to learn relation importance; (2) exploring extra supervision by predicting relation distributions as soft labels with a multi-task scheme; (3) designing a relation-guided re-ranking algorithm for post-processing. Experimental results on two benchmark datasets demonstrate the effectiveness and superiority of our framework, improving the F1 score by 5.8% from 40.5 to 46.3 on CWQ and 5.7% from 62.8 to 68.5 on WebQSP, better or on par with state-of-the-art methods.

Cultural Compass: Predicting Transfer Learning Success in Offensive Language Detection with Cultural Features
Li Zhou | Antonia Karamolegkou | Wenyu Chen | Daniel Hershcovich
Findings of the Association for Computational Linguistics: EMNLP 2023

The increasing ubiquity of language technology necessitates a shift towards considering cultural diversity in the machine learning realm, particularly for subjective tasks that rely heavily on cultural nuances, such as Offensive Language Detection (OLD). Current understanding underscores that these tasks are substantially influenced by cultural values, however, a notable gap exists in determining if cultural features can accurately predict the success of cross-cultural transfer learning for such subjective tasks. Addressing this, our study delves into the intersection of cultural features and transfer learning effectiveness. The findings reveal that cultural value surveys indeed possess a predictive power for cross-cultural transfer learning success in OLD tasks, and that it can be further improved using offensive word distance. Based on these results, we advocate for the integration of cultural information into datasets. Additionally, we recommend leveraging data sources rich in cultural information, such as surveys, to enhance cultural adaptability. Our research signifies a step forward in the quest for more inclusive, culturally sensitive language technologies.

Sentiment Classification of Historical Danish and Norwegian Literary Texts
Ali Al-Laith | Kirstine Nielsen Degn | Alexander Conroy | Bolette Sandford Pedersen | Jens Bjerring-Hansen | Daniel Hershcovich
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Sentiment classification is valuable for literary analysis, as sentiment is crucial in literary narratives. It can, for example, be used to investigate a hypothesis in the literary analysis of 19th-century Scandinavian novels that the writing of female authors in this period was characterized by negative sentiment, as this paper shows. In order to enable a data-driven analysis of this hypothesis, we create a manually annotated dataset of sentence-level sentiment annotations for novels from this period and use it to train and evaluate various sentiment classification methods. We find that pre-trained multilingual language models outperform models trained on modern Danish, as well as classifiers based on lexical resources. Finally, in classifier-assisted corpus analysis, we confirm the literary hypothesis regarding the author’s gender and further shed light on the temporal development of the trend. Our dataset and trained models will be useful for future analysis of historical Danish and Norwegian literary texts.

2022

Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to acknowledge that speakers and the content they produce and require, vary not just by language, but also by culture. Although language and culture are tightly linked, there are important differences. Analogous to cross-lingual and multilingual NLP, cross-cultural and multicultural NLP considers these differences in order to better serve users of NLP systems. We propose a principled framework to frame these efforts, and survey existing and potential strategies.

Generalized Quantifiers as a Source of Error in Multilingual NLU Benchmarks
Ruixiang Cui | Daniel Hershcovich | Anders Søgaard
Proceedings of the First Workshop on Dynamic Adversarial Data Collection

Logical approaches to representing language have developed and evaluated computational models of quantifier words since the 19th century, but today’s NLU models still struggle to capture their semantics. We rely on Generalized Quantifier Theory for language-independent representations of the semantics of quantifier words, to quantify their contribution to the errors of NLU models. We find that quantifiers are pervasive in NLU benchmarks, and their occurrence at test time is associated with performance drops. Multilingual models also exhibit unsatisfying quantifier reasoning abilities, but not necessarily worse for non-English languages. To facilitate directly-targeted probing, we present an adversarial generalized quantifier NLI task (GQNLI) and show that pre-trained language models have a clear lack of robustness in generalized quantifier reasoning.

Towards Climate Awareness in NLP Research
Daniel Hershcovich | Nicolas Webersinke | Mathias Kraus | Julia Bingler | Markus Leippold
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The climate impact of AI, and NLP research in particular, has become a serious issue given the enormous amount of energy that is increasingly being used for training and running computational models. Consequently, increasing focus is placed on efficient NLP. However, this important initiative lacks simple guidelines that would allow for systematic climate reporting of NLP research. We argue that this deficiency is one of the reasons why very few publications in NLP report key figures that would allow a more thorough examination of environmental impact, and present a quantitative survey to demonstrate this. As a remedy, we propose a climate performance model card with the primary purpose of being practically usable with only limited information about experiments and the underlying computer hardware. We describe why this step is essential to increase awareness about the environmental impact of NLP research and, thereby, paving the way for more thorough discussions.

Can AMR Assist Legal and Logical Reasoning?
Nikolaus Schrack | Ruixiang Cui | Hugo López | Daniel Hershcovich
Findings of the Association for Computational Linguistics: EMNLP 2022

Abstract Meaning Representation (AMR) has been shown to be useful for many downstream tasks. In this work, we explore the use of AMR for legal and logical reasoning. Specifically, we investigate if AMR can help capture logical relationships on multiple choice question answering (MCQA) tasks. We propose neural architectures that utilize linearised AMR graphs in combination with pre-trained language models. While these models are not able to outperform text-only baselines, they correctly solve different instances than the text models, suggesting complementary abilities. Error analysis further reveals that AMR parsing quality is the most prominent challenge, especially regarding inputs with multiple sentences. We conduct a theoretical analysis of how logical relations are represented in AMR and conclude it might be helpful in some logical statements but not for others.

Generalized Quantifiers as a Source of Error in Multilingual NLU Benchmarks
Ruixiang Cui | Daniel Hershcovich | Anders Søgaard
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Logical approaches to representing language have developed and evaluated computational models of quantifier words since the 19th century, but today’s NLU models still struggle to capture their semantics. We rely on Generalized Quantifier Theory for language-independent representations of the semantics of quantifier words, to quantify their contribution to the errors of NLU models. We find that quantifiers are pervasive in NLU benchmarks, and their occurrence at test time is associated with performance drops. Multilingual models also exhibit unsatisfying quantifier reasoning abilities, but not necessarily worse for non-English languages. To facilitate directly-targeted probing, we present an adversarial generalized quantifier NLI task (GQNLI) and show that pre-trained language models have a clear lack of robustness in generalized quantifier reasoning.

A Dataset of Sustainable Diet Arguments on Twitter
Marcus Hansen | Daniel Hershcovich
Proceedings of the Second Workshop on NLP for Positive Impact (NLP4PI)

Sustainable development requires a significant change in our dietary habits. Argument mining can help achieve this goal by both affecting and helping understand people’s behavior. We design an annotation scheme for argument mining from online discourse around sustainable diets, including novel evidence types specific to this domain. Using Twitter as a source, we crowdsource a dataset of 597 tweets annotated in relation to 5 topics. We benchmark a variety of NLP models on this dataset, demonstrating strong performance in some sub-tasks, while highlighting remaining challenges.

Compositional Generalization in Multilingual Semantic Parsing over Wikidata
Ruixiang Cui | Rahul Aralikatte | Heather Lent | Daniel Hershcovich
Transactions of the Association for Computational Linguistics, Volume 10

Semantic parsing (SP) allows humans to leverage vast knowledge resources through natural interaction. However, parsers are mostly designed for and evaluated on English resources, such as CFQ (Keysers et al., 2020), the current standard benchmark based on English data generated from grammar rules and oriented towards Freebase, an outdated knowledge base. We propose a method for creating a multilingual, parallel dataset of question-query pairs, grounded in Wikidata. We introduce such a dataset, which we call Multilingual Compositional Wikidata Questions (MCWQ), and use it to analyze the compositional generalization of semantic parsers in Hebrew, Kannada, Chinese, and English. While within- language generalization is comparable across languages, experiments on zero-shot cross- lingual transfer demonstrate that cross-lingual compositional generalization fails, even with state-of-the-art pretrained multilingual encoders. Furthermore, our methodology, dataset, and results will facilitate future research on SP in more realistic and diverse settings than has been possible with existing resources.

2021

Moses and the Character-Based Random Babbling Baseline: CoAStaL at AmericasNLP 2021 Shared Task
Marcel Bollmann | Rahul Aralikatte | Héctor Murrieta Bello | Daniel Hershcovich | Miryam de Lhoneux | Anders Søgaard
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

We evaluated a range of neural machine translation techniques developed specifically for low-resource scenarios. Unsuccessfully. In the end, we submitted two runs: (i) a standard phrase-based model, and (ii) a random babbling baseline using character trigrams. We found that it was surprisingly hard to beat (i), in spite of this model being, in theory, a bad fit for polysynthetic languages; and more interestingly, that (ii) was better than several of the submitted systems, highlighting how difficult low-resource machine translation for polysynthetic languages is.

Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color
Mostafa Abdou | Artur Kulmizev | Daniel Hershcovich | Stella Frank | Ellie Pavlick | Anders Søgaard
Proceedings of the 25th Conference on Computational Natural Language Learning

Pretrained language models have been shown to encode relational information, such as the relations between entities or concepts in knowledge-bases — (Paris, Capital, France). However, simple relations of this type can often be recovered heuristically and the extent to which models implicitly reflect topological structure that is grounded in world, such as perceptual structure, is unknown. To explore this question, we conduct a thorough case study on color. Namely, we employ a dataset of monolexemic color terms and color chips represented in CIELAB, a color space with a perceptually meaningful distance metric. Using two methods of evaluating the structural alignment of colors in this space with text-derived color term representations, we find significant correspondence. Analyzing the differences in alignment across the color spectrum, we find that warmer colors are, on average, better aligned to the perceptual color space than cooler ones, suggesting an intriguing connection to findings from recent work on efficient communication in color naming. Further analysis suggests that differences in alignment are, in part, mediated by collocationality and differences in syntactic usage, posing questions as to the relationship between color perception and usage and context.

A Multilingual Benchmark for Probing Negation-Awareness with Minimal Pairs
Mareike Hartmann | Miryam de Lhoneux | Daniel Hershcovich | Yova Kementchedjhieva | Lukas Nielsen | Chen Qiu | Anders Søgaard
Proceedings of the 25th Conference on Computational Natural Language Learning

Negation is one of the most fundamental concepts in human cognition and language, and several natural language inference (NLI) probes have been designed to investigate pretrained language models’ ability to detect and reason with negation. However, the existing probing datasets are limited to English only, and do not enable controlled probing of performance in the absence or presence of negation. In response, we present a multilingual (English, Bulgarian, German, French and Chinese) benchmark collection of NLI examples that are grammatical and correctly labeled, as a result of manual inspection and reformulation. We use the benchmark to probe the negation-awareness of multilingual language models and find that models that correctly predict examples with negation cues, often fail to correctly predict their counter-examples without negation cues, even when the cues are irrelevant for semantic inference.

Great Service! Fine-grained Parsing of Implicit Arguments
Ruixiang Cui | Daniel Hershcovich
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)

Broad-coverage meaning representations in NLP mostly focus on explicitly expressed content. More importantly, the scarcity of datasets annotating diverse implicit roles limits empirical studies into their linguistic nuances. For example, in the web review “Great service!”, the provider and consumer are implicit arguments of different types. We examine an annotated corpus of fine-grained implicit arguments (Cui and Hershcovich, 2020) by carefully re-annotating it, resolving several inconsistencies. Subsequently, we present the first transition-based neural parser that can handle implicit arguments dynamically, and experiment with two different transition systems on the improved dataset. We find that certain types of implicit arguments are more difficult to parse than others and that the simpler system is more accurate in recovering implicit arguments, despite having a lower overall parsing score, attesting current reasoning limitations of NLP models. This work will facilitate a better understanding of implicit and underspecified language, by incorporating it holistically into meaning representations.

Lexical Semantic Recognition
Nelson F. Liu | Daniel Hershcovich | Michael Kranzlein | Nathan Schneider
Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021)

In lexical semantics, full-sentence segmentation and segment labeling of various phenomena are generally treated separately, despite their interdependence. We hypothesize that a unified lexical semantic recognition task is an effective way to encapsulate previously disparate styles of annotation, including multiword expression identification / classification and supersense tagging. Using the STREUSLE corpus, we train a neural CRF sequence tagger and evaluate its performance along various axes of annotation. As the label set generalizes that of previous tasks (PARSEME, DiMSUM), we additionally evaluate how well the model generalizes to those test sets, finding that it approaches or surpasses existing models despite training only on STREUSLE. Our work also establishes baseline models and evaluation metrics for integrated and accurate modeling of lexical semantics, facilitating future work in this area.

How far can we get with one GPU in 100 hours? CoAStaL at MultiIndicMT Shared Task
Rahul Aralikatte | Héctor Ricardo Murrieta Bello | Miryam de Lhoneux | Daniel Hershcovich | Marcel Bollmann | Anders Søgaard
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

This work shows that competitive translation results can be obtained in a constrained setting by incorporating the latest advances in memory and compute optimization. We train and evaluate large multilingual translation models using a single GPU for a maximum of 100 hours and get within 4-5 BLEU points of the top submission on the leaderboard. We also benchmark standard baselines on the PMI corpus and re-discover well-known shortcomings of translation systems and metrics.

2020

Comparison by Conversion: Reverse-Engineering UCCA from Syntax and Lexical Semantics
Daniel Hershcovich | Nathan Schneider | Dotan Dvir | Jakob Prange | Miryam de Lhoneux | Omri Abend
Proceedings of the 28th International Conference on Computational Linguistics

Building robust natural language understanding systems will require a clear characterization of whether and how various linguistic meaning representations complement each other. To perform a systematic comparative analysis, we evaluate the mapping between meaning representations from different frameworks using two complementary methods: (i) a rule-based converter, and (ii) a supervised delexicalized parser that parses to one framework using only information from the other as features. We apply these methods to convert the STREUSLE corpus (with syntactic and lexical semantic annotations) to UCCA (a graph-structured full-sentence meaning representation). Both methods yield surprisingly accurate target representations, close to fully supervised UCCA parser quality—indicating that UCCA annotations are partially redundant with STREUSLE annotations. Despite this substantial convergence between frameworks, we find several important areas of divergence.

Cross-lingual Semantic Representation for NLP with UCCA
Omri Abend | Dotan Dvir | Daniel Hershcovich | Jakob Prange | Nathan Schneider
Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts

This is an introductory tutorial to UCCA (Universal Conceptual Cognitive Annotation), a cross-linguistically applicable framework for semantic representation, with corpora annotated in English, German and French, and ongoing annotation in Russian and Hebrew. UCCA builds on extensive typological work and supports rapid annotation. The tutorial will provide a detailed introduction to the UCCA annotation guidelines, design philosophy and the available resources; and a comparison to other meaning representations. It will also survey the existing parsing work, including the findings of three recent shared tasks, in SemEval and CoNLL, that addressed UCCA parsing. Finally, the tutorial will present recent applications and extensions to the scheme, demonstrating its value for natural language processing in a range of languages and domains.

Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing
Stephan Oepen | Omri Abend | Lasha Abzianidze | Johan Bos | Jan Hajič | Daniel Hershcovich | Bin Li | Tim O'Gorman | Nianwen Xue | Daniel Zeman
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

MRP 2020: The Second Shared Task on Cross-Framework and Cross-Lingual Meaning Representation Parsing
Stephan Oepen | Omri Abend | Lasha Abzianidze | Johan Bos | Jan Hajic | Daniel Hershcovich | Bin Li | Tim O’Gorman | Nianwen Xue | Daniel Zeman
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

The 2020 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks and languages. Extending a similar setup from the previous year, five distinct approaches to the representation of sentence meaning in the form of directed graphs were represented in the English training and evaluation data for the task, packaged in a uniform graph abstraction and serialization; for four of these representation frameworks, additional training and evaluation data was provided for one additional language per framework. The task received submissions from eight teams, of which two do not participate in the official ranking because they arrived after the closing deadline or made use of additional training data. All technical information regarding the task, including system submissions, official results, and links to supporting resources and software are available from the task web site at: http://mrp.nlpl.eu

HUJI-KU at MRP 2020: Two Transition-based Neural Parsers
Ofir Arviv | Ruixiang Cui | Daniel Hershcovich
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

This paper describes the HUJI-KU system submission to the shared task on CrossFramework Meaning Representation Parsing (MRP) at the 2020 Conference for Computational Language Learning (CoNLL), employing TUPA and the HIT-SCIR parser, which were, respectively, the baseline system and winning system in the 2019 MRP shared task. Both are transition-based parsers using BERT contextualized embeddings. We generalized TUPA to support the newly-added MRP frameworks and languages, and experimented with multitask learning with the HIT-SCIR parser. We reached 4th place in both the crossframework and cross-lingual tracks.

Refining Implicit Argument Annotation for UCCA
Ruixiang Cui | Daniel Hershcovich
Proceedings of the Second International Workshop on Designing Meaning Representations

Predicate-argument structure analysis is a central component in meaning representations of text. The fact that some arguments are not explicitly mentioned in a sentence gives rise to ambiguity in language understanding, and renders it difficult for machines to interpret text correctly. However, only few resources represent implicit roles for NLU, and existing studies in NLP only make coarse distinctions between categories of arguments omitted from linguistic form. This paper proposes a typology for fine-grained implicit argument annotation on top of Universal Conceptual Cognitive Annotation’s foundational layer. The proposed implicit argument categorisation is driven by theories of implicit role interpretation and consists of six types: Deictic, Generic, Genre-based, Type-identifiable, Non-specific, and Iterated-set. We exemplify our design by revisiting part of the UCCA EWT corpus, providing a new dataset annotated with the refinement layer, and making a comparative analysis with other schemes.

Køpsala: Transition-Based Graph Parsing via Efficient Training and Effective Encoding
Daniel Hershcovich | Miryam de Lhoneux | Artur Kulmizev | Elham Pejhan | Joakim Nivre
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

We present Køpsala, the Copenhagen-Uppsala system for the Enhanced Universal Dependencies Shared Task at IWPT 2020. Our system is a pipeline consisting of off-the-shelf models for everything but enhanced graph parsing, and for the latter, a transition-based graph parser adapted from Che et al. (2019). We train a single enhanced parser model per language, using gold sentence splitting and tokenization for training, and rely only on tokenized surface forms and multilingual BERT for encoding. While a bug introduced just before submission resulted in a severe drop in precision, its post-submission fix would bring us to 4th place in the official ranking, according to average ELAS. Our parser demonstrates that a unified pipeline is effective for both Meaning Representation Parsing and Enhanced Universal Dependencies.

2019

Rewarding Coreference Resolvers for Being Consistent with World Knowledge
Rahul Aralikatte | Heather Lent | Ana Valeria Gonzalez | Daniel Hershcovich | Chen Qiu | Anders Sandholm | Michael Ringaard | Anders Søgaard
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Unresolved coreference is a bottleneck for relation extraction, and high-quality coreference resolvers may produce an output that makes it a lot easier to extract knowledge triples. We show how to improve coreference resolvers by forwarding their input to a relation extraction system and reward the resolvers for producing triples that are found in knowledge bases. Since relation extraction systems can rely on different forms of supervision and be biased in different ways, we obtain the best performance, improving over the state of the art, using multi-task reinforcement learning.

Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning
Stephan Oepen | Omri Abend | Jan Hajic | Daniel Hershcovich | Marco Kuhlmann | Tim O’Gorman | Nianwen Xue
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

MRP 2019: Cross-Framework Meaning Representation Parsing
Stephan Oepen | Omri Abend | Jan Hajic | Daniel Hershcovich | Marco Kuhlmann | Tim O’Gorman | Nianwen Xue | Jayeol Chun | Milan Straka | Zdenka Uresova
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

The 2019 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks. Five distinct approaches to the representation of sentence meaning in the form of directed graph were represented in the training and evaluation data for the task, packaged in a uniform abstract graph representation and serialization. The task received submissions from eighteen teams, of which five do not participate in the official ranking because they arrived after the closing deadline, made use of additional training data, or involved one of the task co-organizers. All technical information regarding the task, including system submissions, official results, and links to supporting resources and software are available from the task web site at: http://mrp.nlpl.eu

TUPA at MRP 2019: A Multi-Task Baseline System
Daniel Hershcovich | Ofir Arviv
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

This paper describes the TUPA system submission to the shared task on Cross-Framework Meaning Representation Parsing (MRP) at the 2019 Conference for Computational Language Learning (CoNLL). Because it was prepared by one of the task co-organizers, TUPA provides a baseline point of comparison and is not considered in the official ranking of participating systems. While originally developed for UCCA only, TUPA has been generalized to support all MRP frameworks included in the task, and trained using multi-task learning to parse them all with a shared model. It is a transition-based parser with a BiLSTM encoder, augmented with BERT contextualized embeddings.

Content Differences in Syntactic and Semantic Representation
Daniel Hershcovich | Omri Abend | Ari Rappoport
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Syntactic analysis plays an important role in semantic parsing, but the nature of this role remains a topic of ongoing debate. The debate has been constrained by the scarcity of empirical comparative studies between syntactic and semantic schemes, which hinders the development of parsing methods informed by the details of target schemes and constructions. We target this gap, and take Universal Dependencies (UD) and UCCA as a test case. After abstracting away from differences of convention or formalism, we find that most content divergences can be ascribed to: (1) UCCA’s distinction between a Scene and a non-Scene; (2) UCCA’s distinction between primary relations, secondary ones and participants; (3) different treatment of multi-word expressions, and (4) different treatment of inter-clause linkage. We further discuss the long tail of cases where the two schemes take markedly different approaches. Finally, we show that the proposed comparison methodology can be used for fine-grained evaluation of UCCA parsing, highlighting both challenges and potential sources for improvement. The substantial differences between the schemes suggest that semantic parsers are likely to benefit downstream text understanding applications beyond their syntactic counterparts.

Argument Invention from First Principles
Yonatan Bilu | Ariel Gera | Daniel Hershcovich | Benjamin Sznajder | Dan Lahav | Guy Moshkowich | Anael Malet | Assaf Gavron | Noam Slonim
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Competitive debaters often find themselves facing a challenging task – how to debate a topic they know very little about, with only minutes to prepare, and without access to books or the Internet? What they often do is rely on ”first principles”, commonplace arguments which are relevant to many topics, and which they have refined in past debates. In this work we aim to explicitly define a taxonomy of such principled recurring arguments, and, given a controversial topic, to automatically identify which of these arguments are relevant to the topic. As far as we know, this is the first time that this approach to argument invention is formalized and made explicit in the context of NLP. The main goal of this work is to show that it is possible to define such a taxonomy. While the taxonomy suggested here should be thought of as a ”first attempt” it is nonetheless coherent, covers well the relevant topics and coincides with what professional debaters actually argue in their speeches, and facilitates automatic argument invention for new topics.

The Language of Legal and Illegal Activity on the Darknet
Leshem Choshen | Dan Eldad | Daniel Hershcovich | Elior Sulem | Omri Abend
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The non-indexed parts of the Internet (the Darknet) have become a haven for both legal and illegal anonymous activity. Given the magnitude of these networks, scalably monitoring their activity necessarily relies on automated tools, and notably on NLP tools. However, little is known about what characteristics texts communicated through the Darknet have, and how well do off-the-shelf NLP tools do on this domain. This paper tackles this gap and performs an in-depth investigation of the characteristics of legal and illegal text in the Darknet, comparing it to a clear net website with similar content as a control condition. Taking drugs-related websites as a test case, we find that texts for selling legal and illegal drugs have several linguistic characteristics that distinguish them from one another, as well as from the control condition, among them the distribution of POS tags, and the coverage of their named entities in Wikipedia.

SemEval-2019 Task 1: Cross-lingual Semantic Parsing with UCCA
Daniel Hershcovich | Zohar Aizenbud | Leshem Choshen | Elior Sulem | Ari Rappoport | Omri Abend
Proceedings of the 13th International Workshop on Semantic Evaluation

We present the SemEval 2019 shared task on Universal Conceptual Cognitive Annotation (UCCA) parsing in English, German and French, and discuss the participating systems and results. UCCA is a cross-linguistically applicable framework for semantic representation, which builds on extensive typological work and supports rapid annotation. UCCA poses a challenge for existing parsing techniques, as it exhibits reentrancy (resulting in DAG structures), discontinuous structures and non-terminal nodes corresponding to complex semantic units. The shared task has yielded improvements over the state-of-the-art baseline in all languages and settings. Full results can be found in the task’s website https://competitions.codalab.org/competitions/19160.

Syntactic Interchangeability in Word Embedding Models
Daniel Hershcovich | Assaf Toledo | Alon Halfon | Noam Slonim
Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP

Nearest neighbors in word embedding models are commonly observed to be semantically similar, but the relations between them can vary greatly. We investigate the extent to which word embedding models preserve syntactic interchangeability, as reflected by distances between word vectors, and the effect of hyper-parameters—context window size in particular. We use part of speech (POS) as a proxy for syntactic interchangeability, as generally speaking, words with the same POS are syntactically valid in the same contexts. We also investigate the relationship between interchangeability and similarity as judged by commonly-used word similarity benchmarks, and correlate the result with the performance of word embedding models on these benchmarks. Our results will inform future research and applications in the selection of word embedding model, suggesting a principle for an appropriate selection of the context window size parameter depending on the use-case.

2018

Universal Dependency Parsing with a General Transition-Based DAG Parser
Daniel Hershcovich | Omri Abend | Ari Rappoport
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper presents our experiments with applying TUPA to the CoNLL 2018 UD shared task. TUPA is a general neural transition-based DAG parser, which we use to present the first experiments on recovering enhanced dependencies as part of the general parsing task. TUPA was designed for parsing UCCA, a cross-linguistic semantic annotation scheme, exhibiting reentrancy, discontinuity and non-terminal nodes. By converting UD trees and graphs to a UCCA-like DAG format, we train TUPA almost without modification on the UD parsing task. The generic nature of our approach lends itself naturally to multitask learning.

Multitask Parsing Across Semantic Representations
Daniel Hershcovich | Omri Abend | Ari Rappoport
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The ability to consolidate information of different types is at the core of intelligence, and has tremendous practical value in allowing learning for one task to benefit from generalizations learned for others. In this paper we tackle the challenging task of improving semantic parsing performance, taking UCCA parsing as a test case, and AMR, SDP and Universal Dependencies (UD) parsing as auxiliary tasks. We experiment on three languages, using a uniform transition-based system and learning architecture for all parsing tasks. Despite notable conceptual, formal and domain differences, we show that multitask learning significantly improves UCCA parsing in both in-domain and out-of-domain settings.

2017

A Transition-Based Directed Acyclic Graph Parser for UCCA
Daniel Hershcovich | Omri Abend | Ari Rappoport
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present the first parser for UCCA, a cross-linguistically applicable framework for semantic representation, which builds on extensive typological work and supports rapid annotation. UCCA poses a challenge for existing parsing techniques, as it exhibits reentrancy (resulting in DAG structures), discontinuous structures and non-terminal nodes corresponding to complex semantic units. To our knowledge, the conjunction of these formal properties is not supported by any existing parser. Our transition-based parser, which uses a novel transition set and features based on bidirectional LSTMs, has value not just for UCCA parsing: its ability to handle more general graph structures can inform the development of parsers for other semantic DAG structures, and in languages that frequently use discontinuous structures.

2015

Automatic Claim Negation: Why, How and When
Yonatan Bilu | Daniel Hershcovich | Noam Slonim
Proceedings of the 2nd Workshop on Argumentation Mining

2014

Context Dependent Claim Detection
Ran Levy | Yonatan Bilu | Daniel Hershcovich | Ehud Aharoni | Noam Slonim
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

A Benchmark Dataset for Automatic Detection of Claims and Evidence in the Context of Controversial Topics
Ehud Aharoni | Anatoly Polnarov | Tamar Lavee | Daniel Hershcovich | Ran Levy | Ruty Rinott | Dan Gutfreund | Noam Slonim
Proceedings of the First Workshop on Argumentation Mining

Co-authors

Miryam de Lhoneux 8

Jens Bjerring-Hansen 5

Laura Cabello Piqueras 5

Alexander Conroy 5

Ari Rappoport 5

Rahul Aralikatte 4

Antonia Karamolegkou 4

Stephan Oepen 4

Marcel Bollmann 3

Tim O’Gorman 3

Bolette Sandford Pedersen 3

Nathan Schneider 3

Mostafa Abdou 2

Lasha Abzianidze 2

Luciana Benotti 2

Leshem Choshen 2

Kirstine Nielsen Degn 2

Dan Gutfreund 2

Yova Kementchedjhieva 2

Marco Kuhlmann 2

Artur Kulmizev 2

Anatoly Polnarov 2

Vinodkumar Prabhakaran 2

Ekaterina Shutova 2

Zohar Aizenbud 1

Carlos Alzate 1

Ruth-Ann Armstrong 1

Isabelle Augenstein 1

Julia Bingler 1

Johannes Bjerva 1

Stephanie Brandl 1

Emanuele Bugliarello 1

Ilias Chalkidis 1

Michel DeGraff 1

Thierry Declerck 1

Lucia Donatelli 1

Abee Eijansantos 1

Desmond Elliott 1

Marcell Fekete 1

Constanza Fierro 1

Nicolas Garneau 1

Ana Valeria González 1

Morgan Grobol 1

Marcus Hansen 1

Mareike Hartmann 1

Hans Erik Heje 1

Diptesh Kanojia 1

Taelin Karidi 1

Mitesh M. Khapra 1

Alexander Koller 1

David Konopnicki 1

Michael Kranzlein 1

Mathias Kraus 1

Ernests Lavrinovics 1

Markus Leippold 1

Carsten Levisen 1

Nelson F. Liu 1

Maria Maistro 1

Catriona Malau 1

Katerina Margatina 1

Guy Moshkowich 1

Héctor Murrieta Bello 1

Héctor Ricardo Murrieta Bello 1

Roberto Navigli 1

Lukas Nielsen 1

Malvina Nikandrou 1

Benjamin Paddags 1

Georgios Pantazopoulos 1

Jakob Ingemann Parby 1

Ellie Pavlick 1

Esther Ploeger 1

Michael Ringaard 1

Paul Röttger 1

Danae Sanchez Villegas 1

Anders Sandholm 1

Valkyrie Savage 1

Nina Schneidermann 1

Steven Schockaert 1

Nikolaus Schrack 1

Rico Sennrich 1

Benjamin Sznajder 1

Timothy R Tangherlini 1

Kushal Tatariya 1

Simone Tedeschi 1

Zdenka Uresova 1

Supriti Vijay 1

Qian Janice Wang 1

Nicolas Webersinke 1

Srishti Yadav 1

Crystina Zhang 1

Naama Zwerdling 1

Venues