2024
pdf
bib
abs
One Thousand and One Pairs: A “novel” challenge for long-context language models
Marzena Karpinska
|
Katherine Thai
|
Kyle Lo
|
Tanya Goyal
|
Mohit Iyyer
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Synthetic long-context LLM benchmarks (e.g., “needle-in-the-haystack”) test only surface-level retrieval capabilities; but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, written by human readers of those books. In contrast to existing long-context benchmarks, our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. Our experiments show that while human readers easily perform this task, it is enormously challenging for all ten long-context LLMs that we evaluate: no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks), while GPT-4o achieves the highest pair accuracy at 55.8%. Further analysis reveals that (1) on average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning; (2) model-generated explanations for their decisions are often inaccurate even for correctly-labeled claims; and (3) models perform substantially worse on speculative fiction books that contain extensive world-building. The methodology proposed in NoCha allows for the evolution of the benchmark dataset and the easy analysis of future models.
pdf
bib
abs
Findings of the WMT24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet
Tom Kocmi
|
Eleftherios Avramidis
|
Rachel Bawden
|
Ondřej Bojar
|
Anton Dvorkovich
|
Christian Federmann
|
Mark Fishel
|
Markus Freitag
|
Thamme Gowda
|
Roman Grundkiewicz
|
Barry Haddow
|
Marzena Karpinska
|
Philipp Koehn
|
Benjamin Marie
|
Christof Monz
|
Kenton Murray
|
Masaaki Nagata
|
Martin Popel
|
Maja Popović
|
Mariya Shmatova
|
Steinthór Steingrímsson
|
Vilém Zouhar
Proceedings of the Ninth Conference on Machine Translation
This overview paper presents the results of the General Machine Translation Task organised as part of the 2024 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of three to five different domains. In addition to participating systems, we collected translations from 8 different large language models (LLMs) and 4 online translation providers. We evaluate system outputs with professional human annotators using a new protocol called Error Span Annotations (ESA).
pdf
bib
abs
Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation
Tom Kocmi
|
Vilém Zouhar
|
Eleftherios Avramidis
|
Roman Grundkiewicz
|
Marzena Karpinska
|
Maja Popović
|
Mrinmaya Sachan
|
Mariya Shmatova
Proceedings of the Ninth Conference on Machine Translation
High-quality Machine Translation (MT) evaluation relies heavily on human judgments.Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages.On the other hand, just assigning overall scores, like Direct Assessment (DA), is simpler and faster and can be done by translators of any level, but is less reliable.In this paper, we introduce Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM.We validate ESA by comparing it to MQM and DA for 12 MT systems and one human reference translation (English to German) from WMT23. The results show that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.
pdf
bib
abs
NarrativeTime: Dense Temporal Annotation on a Timeline
Anna Rogers
|
Marzena Karpinska
|
Ankita Gupta
|
Vladislav Lialin
|
Gregory Smelkov
|
Anna Rumshisky
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
For the past decade, temporal annotation has been sparse: only a small portion of event pairs in a text was annotated. We present NarrativeTime, the first timeline-based annotation framework that achieves full coverage of all possible TLINKs. To compare with the previous SOTA in dense temporal annotation, we perform full re-annotation of the classic TimeBankDense corpus (American English), which shows comparable agreement with a signigicant increase in density. We contribute TimeBankNT corpus (with each text fully annotated by two expert annotators), extensive annotation guidelines, open-source tools for annotation and conversion to TimeML format, and baseline results.
2023
pdf
bib
abs
Program Chairs’ Report on Peer Review at ACL 2023
Anna Rogers
|
Marzena Karpinska
|
Jordan Boyd-Graber
|
Naoaki Okazaki
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We present a summary of the efforts to improve conference peer review that were implemented at ACL’23. This includes work with the goal of improving review quality, clearer workflow and decision support for the area chairs, as well as our efforts to improve paper-reviewer matching for various kinds of non- mainstream NLP work, and improve the overall incentives for all participants of the peer review process. We present analysis of the factors affecting peer review, identify the most problematic issues that the authors complained about, and provide suggestions for the future chairs. We hope that publishing such reports would (a) improve transparency in decision-making, (b) help the people new to the field to understand how the *ACL conferences work, (c) provide useful data for the future chairs and workshop organizers, and also academic work on peer review, and (d) provide useful context for the final program, as a source of information for meta-research on the structure and trajectory of the field of NLP.
pdf
bib
abs
ezCoref: Towards Unifying Annotation Guidelines for Coreference Resolution
Ankita Gupta
|
Marzena Karpinska
|
Wenlong Zhao
|
Kalpesh Krishna
|
Jack Merullo
|
Luke Yeh
|
Mohit Iyyer
|
Brendan O’Connor
Findings of the Association for Computational Linguistics: EACL 2023
Large-scale, high-quality corpora are critical for advancing research in coreference resolution. However, existing datasets vary in their definition of coreferences and have been collected via complex and lengthy guidelines that are curated for linguistic experts. These concerns have sparked a growing interest among researchers to curate a unified set of guidelines suitable for annotators with various backgrounds. In this work, we develop a crowdsourcing-friendly coreference annotation methodology, ezCoref, consisting of an annotation tool and an interactive tutorial. We use ezCoref to re-annotate 240 passages from seven existing English coreference datasets (spanning fiction, news, and multiple other domains) while teaching annotators only cases that are treated similarly across these datasets. Surprisingly, we find that reasonable quality annotations were already achievable (90% agreement between the crowd and expert annotations) even without extensive training. On carefully analyzing the remaining disagreements, we identify the presence of linguistic cases that our annotators unanimously agree upon but lack unified treatments (e.g., generic pronouns, appositives) in existing datasets. We propose the research community should revisit these phenomena when curating future unified annotation guidelines.
pdf
bib
abs
Large Language Models Effectively Leverage Document-level Context for Literary Translation, but Critical Errors Persist
Marzena Karpinska
|
Mohit Iyyer
Proceedings of the Eighth Conference on Machine Translation
Large language models (LLMs) are competitive with the state of the art on a wide range of sentence-level translation datasets. However, their ability to translate paragraphs and documents remains unexplored because evaluation in these settings is costly and difficult. We show through a rigorous human evaluation that asking the GPT-3.5 (text-davinci-003) LLM to translate an entire literary paragraph (e.g., from a novel) at once results in higher-quality translations than standard sentence-by-sentence translation across 18 linguistically-diverse language pairs (e.g., translating into and out of Japanese, Polish, and English). Our evaluation, which took approximately 350 hours of effort for annotation and analysis, is conducted by hiring translators fluent in both the source and target language and asking them to provide both span-level error annotations as well as preference judgments of which system’s translations are better. We observe that discourse-level LLM translators commit fewer mistranslations, grammar errors, and stylistic inconsistencies than sentence-level approaches. With that said, critical errors still abound, including occasional content omissions, and a human translator’s intervention remains necessary to ensure that the author’s voice remains intact. We publicly release our dataset and error annotations to spur future research on the evaluation of document-level literary translation.
2022
pdf
bib
abs
DEMETR: Diagnosing Evaluation Metrics for Translation
Marzena Karpinska
|
Nishant Raj
|
Katherine Thai
|
Yixiao Song
|
Ankita Gupta
|
Mohit Iyyer
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
While machine translation evaluation metrics based on string overlap (e.g., BLEU) have their limitations, their computations are transparent: the BLEU score assigned to a particular candidate translation can be traced back to the presence or absence of certain words. The operations of newer learned metrics (e.g., BLEURT, COMET), which leverage pretrained language models to achieve higher correlations with human quality judgments than BLEU, are opaque in comparison. In this paper, we shed light on the behavior of these learned metrics by creating DEMETR, a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories. All perturbations were carefully designed to form minimal pairs with the actual translation (i.e., differ in only one aspect). We find that learned metrics perform substantially better than string-based metrics on DEMETR. Additionally, learned metrics differ in their sensitivity to various phenomena (e.g., BERTScore is sensitive to untranslated words but relatively insensitive to gender manipulation, while COMET is much more sensitive to word repetition than to aspectual changes). We publicly release DEMETR to spur more informed future development of machine translation evaluation metrics
pdf
bib
abs
Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature
Katherine Thai
|
Marzena Karpinska
|
Kalpesh Krishna
|
Bill Ray
|
Moira Inghilleri
|
John Wieting
|
Mohit Iyyer
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Literary translation is a culturally significant task, but it is bottlenecked by the small number of qualified literary translators relative to the many untranslated works published around the world. Machine translation (MT) holds potential to complement the work of human translators by improving both training procedures and their overall efficiency. Literary translation is less constrained than more traditional MT settings since translators must balance meaning equivalence, readability, and critical interpretability in the target language. This property, along with the complex discourse-level context present in literary texts, also makes literary MT more challenging to computationally model and evaluate. To explore this task, we collect a dataset (Par3) of non-English language novels in the public domain, each aligned at the paragraph level to both human and automatic English translations. Using Par3, we discover that expert literary translators prefer reference human translations over machine-translated paragraphs at a rate of 84%, while state-of-the-art automatic MT metrics do not correlate with those preferences. The experts note that MT outputs contain not only mistranslations, but also discourse-disrupting errors and stylistic inconsistencies. To address these problems, we train a post-editing model whose output is preferred over normal MT output at a rate of 69% by experts. We publicly release Par3 to spur future research into literary MT.
pdf
bib
abs
Revisiting Statistical Laws of Semantic Shift in Romance Cognates
Yoshifumi Kawasaki
|
Maëlys Salingre
|
Marzena Karpinska
|
Hiroya Takamura
|
Ryo Nagata
Proceedings of the 29th International Conference on Computational Linguistics
This article revisits statistical relationships across Romance cognates between lexical semantic shift and six intra-linguistic variables, such as frequency and polysemy. Cognates are words that are derived from a common etymon, in this case, a Latin ancestor. Despite their shared etymology, some cognate pairs have experienced semantic shift. The degree of semantic shift is quantified using cosine distance between the cognates’ corresponding word embeddings. In the previous literature, frequency and polysemy have been reported to be correlated with semantic shift; however, the understanding of their effects needs revision because of various methodological defects. In the present study, we perform regression analysis under improved experimental conditions, and demonstrate a genuine negative effect of frequency and positive effect of polysemy on semantic shift. Furthermore, we reveal that morphologically complex etyma are more resistant to semantic shift and that the cognates that have been in use over a longer timespan are prone to greater shift in meaning. These findings add to our understanding of the historical process of semantic change.
2021
pdf
bib
abs
The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation
Marzena Karpinska
|
Nader Akoury
|
Mohit Iyyer
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Recent text generation research has increasingly focused on open-ended domains such as story and poetry generation. Because models built for such tasks are difficult to evaluate automatically, most researchers in the space justify their modeling choices by collecting crowdsourced human judgments of text quality (e.g., Likert scores of coherence or grammaticality) from Amazon Mechanical Turk (AMT). In this paper, we first conduct a survey of 45 open-ended text generation papers and find that the vast majority of them fail to report crucial details about their AMT tasks, hindering reproducibility. We then run a series of story evaluation experiments with both AMT workers and English teachers and discover that even with strict qualification filters, AMT workers (unlike teachers) fail to distinguish between model-generated text and human-generated references. We show that AMT worker judgments improve when they are shown model-generated output alongside human-generated references, which enables the workers to better calibrate their ratings. Finally, interviews with the English teachers provide deeper insights into the challenges of the evaluation process, particularly when rating model-generated text.
2018
pdf
bib
abs
Subcharacter Information in Japanese Embeddings: When Is It Worth It?
Marzena Karpinska
|
Bofang Li
|
Anna Rogers
|
Aleksandr Drozd
Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP
Languages with logographic writing systems present a difficulty for traditional character-level models. Leveraging the subcharacter information was recently shown to be beneficial for a number of intrinsic and extrinsic tasks in Chinese. We examine whether the same strategies could be applied for Japanese, and contribute a new analogy dataset for this language.