Aviya Maimon

2025

A Novel Computational Modeling Foundation for Automatic Coherence Assessment
Aviya Maimon
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Coherence is an essential property of well-written texts, that refers to the way textual units relate to one another. In the era of generative AI, coherence assessment is essential for many NLP tasks such as summarization, long-form question-answering, and more.Current NLP approaches for modeling coherence often rely on a proxy task, specifically, sentence reordering. However, such an approach may not capture the full range of factors contributing to coherence.To remedy this, in this work we employ the formal linguistic definition by Reinhart:1980 of what makes a discourse coherent, consisting of three conditions, cohesion, consistency and relevance, and formalize these conditions as respective computational tasks, which are in turn jointly trained. We evaluate this modeling approach on two human-rated coherence benchmarks: one of automatically-generated stories and one of real-world texts.Our experiments show that jointly training on the proposed tasks leads to better performance on each task compared with task-specific models, and to better performance on assessing coherence overall.Our proposed computational framework thus paves the way for a more advanced, broad-coverage coherence assessment.

2024

pdf bib abs

Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP
Omer Goldman | Alon Jacovi | Aviv Slobodkin | Aviya Maimon | Ido Dagan | Reut Tsarfaty
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Improvements in language models’ capabilities have pushed their applications towards longer contexts, making long-context evaluation and development an active research area. However, many disparate use-cases are grouped together under the umbrella term of “long-context”, defined simply by the total length of the model’s input, including - for example - Needle-in-a-Haystack tasks, book summarization, and information aggregation. Given their varied difficulty, in this position paper we argue that conflating different tasks by their context length is unproductive. As a community, we require a more precise vocabulary to understand what makes long-context tasks similar or different. We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts. We propose two orthogonal axes of difficulty: (I) Diffusion: How hard is it to find the necessary information in the context? (II) Scope: How much necessary information is there to find? We survey the literature on long-context, provide justification for this taxonomy as an informative descriptor, and situate the literature with respect to it. We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored. By using a descriptive vocabulary and discussing the relevant properties of difficulty in long-context, we can implement more informed research in this area. We call for a careful design of tasks and benchmarks with distinctly long context, taking into account the characteristics that make it qualitatively different from shorter context.

2023

pdf bib abs

COHESENTIA: A Novel Benchmark of Incremental versus Holistic Assessment of Coherence in Generated Texts
Aviya Maimon | Reut Tsarfaty
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Coherence is a linguistic term that refers to the relations between small textual units (sentences, propositions), which make the text logically consistent and meaningful to the reader. With the advances of generative foundational models in NLP, there is a pressing need to automatically assess the human-perceived coherence of automatically generated texts. Up until now, little work has been done on explicitly assessing the coherence of generated texts and analyzing the factors contributing to (in)coherence. Previous work on the topic used other tasks, e.g., sentence reordering, as proxies of coherence, rather than approaching coherence detection heads on. In this paper, we introduce CoheSentia, a novel benchmark of human-perceived coherence of automatically generated texts. Our annotation protocol reflects two perspectives; one is global, assigning a single coherence score, and the other is incremental, scoring sentence by sentence. The incremental method produces an (in)coherence score for each text fragment and also pinpoints reasons for incoherence at that point. Our benchmark contains 500 automatically-generated and human-annotated paragraphs, each annotated in both methods, by multiple raters. Our analysis shows that the inter-annotator agreement in the incremental mode is higher than in the holistic alternative, and our experiments show that standard LMs fine-tuned for coherence detection show varied performance on the different factors contributing to (in)coherence. All in all, these models yield unsatisfactory performance, emphasizing the need for developing more reliable methods for coherence assessment.

Co-authors

Venues

emnlp2
naacl1

Fix author