Murathan Kurfalı

Also published as: Murathan Kurfali

2025

pdf bib abs
LLM-based post-editing as reference-free GEC evaluation
Robert Östling | Murathan Kurfali | Andrew Caines
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Evaluation of Grammatical Error Correction (GEC) systems is becoming increasingly challenging as the quality of such systems increases and traditional automatic metrics fail to adequately capture such nuances as fluency versus minimal edits, alternative valid corrections compared to the ‘ground truth’, and the difference between corrections that are useful in a language learning scenario versus those preferred by native readers. Previous work has suggested using human post-editing of GEC system outputs, but this is very labor-intensive. We investigate the use of Large Language Models (LLMs) as post-editors of English and Swedish texts, and perform a meta-analysis of a range of different evaluation setups using a set of recent GEC systems. We find that for the two languages studied in our work, automatic evaluation based on post-editing agrees well with both human post-editing and direct human rating of GEC systems. Furthermore, we find that a simple n-gram overlap metric is sufficient to measure post-editing distance, and that including human references when prompting the LLMs generally does not improve agreement with human ratings. The resulting evaluation metric is reference-free and requires no language-specific training or additional resources beyond an LLM capable of handling the given language.Evaluation of Grammatical Error Correction (GEC) systems is becoming increasingly challenging as the quality of such systems increases and traditional automatic metrics fail to adequately capture such nuances as fluency versus minimal edits, alternative valid corrections compared to the ‘ground truth’, and the difference between corrections that are useful in a language learning scenario versus those preferred by native readers. Previous work has suggested using human post-editing of GEC system outputs, but this is very labor-intensive. We investigate the use of Large Language Models (LLMs) as post-editors of English and Swedish texts, and perform a meta-analysis of a range of different evaluation setups using a set of recent GEC systems. We find that for the two languages studied in our work, automatic evaluation based on post-editing agrees well with both human post-editing and direct human rating of GEC systems. Furthermore, we find that a simple n-gram overlap metric is sufficient to measure post-editing distance, and that including human references when prompting the LLMs generally does not improve agreement with human ratings. The resulting evaluation metric is reference-free and requires no language-specific training or additional resources beyond an LLM capable of handling the given language.

pdf bib abs
ClimateEval: A Comprehensive Benchmark for NLP Tasks Related to Climate Change
Murathan Kurfali | Shorouq Zahra | Joakim Nivre | Gabriele Messori
Proceedings of the 2nd Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2025)

ClimateEval is a comprehensive benchmark designed to evaluate natural language processing models across a broad range of tasks related to climate change. ClimateEval aggregates existing datasets along with a newly developed news classification dataset, created specifically for this release. This results in a benchmark of 25 tasks based on 13 datasets, covering key aspects of climate discourse, including text classification, question answering, and information extraction. Our benchmark provides a standardized evaluation suite for systematically assessing the performance of large language models (LLMs) on these tasks. Additionally, we conduct an extensive evaluation of open-source LLMs (ranging from 2B to 70B parameters) in both zero-shot and few-shot settings, analyzing their strengths and limitations in the domain of climate change.

pdf bib abs
Conflicting Needles in a Haystack: How LLMs behave when faced with contradictory information
Murathan Kurfali | Robert Östling
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have demonstrated an impressive ability to retrieve and summarize complex information, but their reliability in conflicting contexts remains poorly understood. We introduce an adversarial extension of the Needle-in-a-Haystack framework in which three mutually exclusive “needles” are embedded within long documents. By systematically manipulating factors such as position, repetition, layout, and domain relevance, we evaluate how LLMs handle contradictions. We find that models almost always fail to signal uncertainty and instead confidently select a single answer, exhibiting strong and consistent biases toward repetition, recency, and particular surface forms. We further analyze whether these patterns persist across model families and sizes, and we evaluate both probability-based and generation-based retrieval. Our framework highlights critical limitations in the robustness of current LLMs—including commercial systems—to contradiction. These limitations reveal potential shortcomings in RAG systems’ ability to handle noisy or manipulated inputs and exposes risks for deployment in high-stakes applications.

pdf bib
The MultiGEC-2025 Shared Task on Multilingual Grammatical Error Correction at NLP4CALL
Arianna Masciolini | Andrew Caines | Orphée De Clercq | Joni Kruijsbergen | Murathan Kurfalı | Ricardo Muñoz Sánchez | Elena Volodina | Robert Östling
Proceedings of the 14th Workshop on Natural Language Processing for Computer Assisted Language Learning

pdf bib abs
SweSAT-1.0: The Swedish University Entrance Exam as a Benchmark for Large Language Models
Murathan Kurfalı | Shorouq Zahra | Evangelia Gogoulou | Luise Dürlich | Fredrik Carlsson | Joakim Nivre
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

This introduces SweSAT-1.0, a new benchmark dataset created from the Swedish university entrance exam (Högskoleprovet) to assess large language models in Swedish. The current version of the benchmark includes 867 questions across six different tasks, including reading comprehension, mathematical problem solving, and logical reasoning. We find that some widely used open-source and commercial models excel in verbal tasks, but we also see that all models, even the commercial ones, struggle with reasoning tasks in Swedish. We hope that SweSAT-1.0 will facilitate research on large language models for Swedish by enriching the breadth of available tasks, offering a challenging evaluation benchmark that is free from any translation biases.

2024

To better understand how extreme climate events impact society, we need to increase the availability of accurate and comprehensive information about these impacts. We propose a method for building large-scale databases of climate extreme impacts from online textual sources, using LLMs for information extraction in combination with more traditional NLP techniques to improve accuracy and consistency. We evaluate the method against a small benchmark database created by human experts and find that extraction accuracy varies for different types of information. We compare three different LLMs and find that, while the commercial GPT-4 model gives the best performance overall, the open-source models Mistral and Mixtral are competitive for some types of information.

pdf bib abs
Lightweight Connective Detection Using Gradient Boosting
Mustafa Erolcan Er | Murathan Kurfalı | Deniz Zeyrek
Proceedings of the 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation @ LREC-COLING 2024

In this work, we introduce a lightweight discourse connective detection system. Employing gradient boosting trained on straightforward, low-complexity features, this proposed approach sidesteps the computational demands of the current approaches that rely on deep neural networks. Considering its simplicity, our approach achieves competitive results while offering significant gains in terms of time even on CPU. Furthermore, the stable performance across two unrelated languages suggests the robustness of our system in the multilingual scenario. The model is designed to support the annotation of discourse relations, particularly in scenarios with limited resources, while minimizing performance loss.

pdf bib abs
Evaluation of Really Good Grammatical Error Correction
Robert Östling | Katarina Gillholm | Murathan Kurfalı | Marie Mattson | Mats Wirén
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Traditional evaluation methods for Grammatical Error Correction (GEC) fail to fully capture the full range of system capabilities and objectives. The emergence of large language models (LLMs) has further highlighted the shortcomings of these evaluation strategies, emphasizing the need for a paradigm shift in evaluation methodology. In the current study, we perform a comprehensive evaluation of various GEC systems using a recently published dataset of Swedish learner texts. The evaluation is performed using established evaluation metrics as well as human judges. We find that GPT-3 in a few-shot setting by far outperforms previous grammatical error correction systems for Swedish, a language comprising only about 0.1% of its training data. We also found that current evaluation methods contain undesirable biases that a human evaluation is able to reveal. We suggest using human post-editing of GEC system outputs to analyze the amount of change required to reach native-level human performance on the task, and provide a dataset annotated with human post-edits and assessments of grammaticality, fluency and meaning preservation of GEC system outputs.

2023

pdf bib abs
Language Embeddings Sometimes Contain Typological Generalizations
Robert Östling | Murathan Kurfalı
Computational Linguistics, Volume 49, Issue 4 - December 2023

To what extent can neural network models learn generalizations about language structure, and how do we find out what they have learned? We explore these questions by training neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1,295 languages. The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features obtained through annotation projection. We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most of our models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations. Careful attention to details in the evaluation turns out to be essential to avoid false positives. Furthermore, to encourage continued work in this field, we release several resources covering most or all of the languages in our data: (1) multiple sets of language representations, (2) multilingual word embeddings, (3) projected and predicted syntactic and morphological features, (4) software to provide linguistically sound evaluations of language representations.

pdf bib
A distantly supervised Grammatical Error Detection/Correction system for Swedish
Murathan Kurfalı | Robert Östling
Proceedings of the 12th Workshop on NLP for Computer Assisted Language Learning

2021

pdf bib abs
Probing Multilingual Language Models for Discourse
Murathan Kurfalı | Robert Östling
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

Pre-trained multilingual language models have become an important building block in multilingual Natural Language Processing. In the present paper, we investigate a range of such models to find out how well they transfer discourse-level knowledge across languages. This is done with a systematic evaluation on a broader set of discourse-level tasks than has been previously been assembled. We find that the XLM-RoBERTa family of models consistently show the best performance, by simultaneously being good monolingual models and degrading relatively little in a zero-shot setting. Our results also indicate that model distillation may hurt the ability of cross-lingual transfer of sentence representations, while language dissimilarity at most has a modest effect. We hope that our test suite, covering 5 tasks with a total of 22 languages in 10 distinct families, will serve as a useful evaluation platform for multilingual performance at and beyond the sentence level.

pdf bib abs
Let’s be explicit about that: Distant supervision for implicit discourse relation classification via connective prediction
Murathan Kurfalı | Robert Östling
Proceedings of the 1st Workshop on Understanding Implicit and Underspecified Language

In implicit discourse relation classification, we want to predict the relation between adjacent sentences in the absence of any overt discourse connectives. This is challenging even for humans, leading to shortage of annotated data, a fact that makes the task even more difficult for supervised machine learning approaches. In the current study, we perform implicit discourse relation classification without relying on any labeled implicit relation. We sidestep the lack of data through explicitation of implicit relations to reduce the task to two sub-problems: language modeling and explicit discourse relation classification, a much easier problem. Our experimental results show that this method can even marginally outperform the state-of-the-art, in spite of being much simpler than alternative models of comparable performance. Moreover, we show that the achieved performance is robust across domains as suggested by the zero-shot experiments on a completely different domain. This indicates that recent advances in language modeling have made language models sufficiently good at capturing inter-sentence relations without the help of explicit discourse markers.

2020

pdf bib abs
TED-MDB Lexicons: Tr-EnConnLex, Pt-EnConnLex
Murathan Kurfalı | Sibel Ozer | Deniz Zeyrek | Amália Mendes
Proceedings of the First Workshop on Computational Approaches to Discourse

In this work, we present two new bilingual discourse connective lexicons, namely, for Turkish-English and European Portuguese-English created automatically using the existing discourse relation-aligned TED-MDB corpus. In their current form, the Pt-En lexicon includes 95 entries, whereas the Tr-En lexicon contains 133 entries. The lexicons constitute the first step of a larger project of developing a multilingual discourse connective lexicon.

pdf bib abs
Zero-shot cross-lingual identification of direct speech using distant supervision
Murathan Kurfalı | Mats Wirén
Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Prose fiction typically consists of passages alternating between the narrator’s telling of the story and the characters’ direct speech in that story. Detecting direct speech is crucial for the downstream analysis of narrative structure, and may seem easy at first thanks to quotation marks. However, typographical conventions vary across languages, and as a result, almost all approaches to this problem have been monolingual. In contrast, the aim of this paper is to provide a multilingual method for identifying direct speech. To this end, we created a training corpus by using a set of heuristics to automatically find texts where quotation marks appear sufficiently consistently. We then removed the quotation marks and developed a sequence classifier based on multilingual-BERT which classifies each token as belonging to narration or speech. Crucially, by training the classifier with the quotation marks removed, it was forced to learn the linguistic characteristics of direct speech rather than the typography of quotation marks. The results in the zero-shot setting of the proposed model are comparable to the strong supervised baselines, indicating that this is a feasible approach.

pdf bib abs
A Sentiment-annotated Dataset of English Causal Connectives
Marta Andersson | Murathan Kurfalı | Robert Östling
Proceedings of the 14th Linguistic Annotation Workshop

This paper investigates the semantic prosody of three causal connectives: due to, owing to and because of in seven varieties of the English language. While research in the domain of English causality exists, we are not aware of studies that would cover the domain of causal connectives in English. Our claim is that connectives such as because of link two arguments, (at least) one of which will include a phrase that contributes to the interpretation of the relation as positive or negative, and hence define the prosody of the connective used. As our results demonstrate, the majority of the prosodies identified are negative for all three connectives; the proportions are stable across the varieties of English studied, and contrary to our expectations, we find no significant differences between the functions of the connectives and discourse preferences. Further, we investigate whether automatizing the sentiment annotation procedure via a simple language-model based classifier is possible. The initial results highlights the complexity of the task and the need for complicated systems, probably aided with other related datasets to achieve reasonable performance.

pdf bib abs
A Multi-word Expression Dataset for Swedish
Murathan Kurfalı | Robert Östling | Johan Sjons | Mats Wirén
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a new set of 96 Swedish multi-word expressions annotated with degree of (non-)compositionality. In contrast to most previous compositionality datasets we also consider syntactically complex constructions and publish a formal specification of each expression. This allows evaluation of computational models beyond word bigrams, which have so far been the norm. Finally, we use the annotations to evaluate a system for automatic compositionality estimation based on distributional semantics. Our analysis of the disagreements between human annotators and the distributional model reveal interesting questions related to the perception of compositionality, and should be informative to future work in the area.

pdf bib abs
Disambiguation of Potentially Idiomatic Expressions with Contextual Embeddings
Murathan Kurfalı | Robert Östling
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

The majority of multiword expressions can be interpreted as figuratively or literally in different contexts which pose challenges in a number of downstream tasks. Most previous work deals with this ambiguity following the observation that MWEs with different usages occur in distinctly different contexts. Following this insight, we explore the usefulness of contextual embeddings by means of both supervised and unsupervised classification. The results show that in the supervised setting, the state-of-the-art can be substantially improved for all expressions in the experiments. The unsupervised classification, similarly, yields very impressive results, comparing favorably to the supervised classifier for the majority of the expressions. We also show that multilingual contextual embeddings can also be employed for this task without leading to any significant loss in performance; hence, the proposed methodology has the potential to be extended to a number of languages.

pdf bib abs
TRAVIS at PARSEME Shared Task 2020: How good is (m)BERT at seeing the unseen?
Murathan Kurfalı
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

This paper describes the TRAVIS system built for the PARSEME Shared Task 2020 on semi-supervised identification of verbal multiword expressions. TRAVIS is a fully feature-independent model, relying only on the contextual embeddings. We have participated with two variants of TRAVIS, TRAVIS-multi and TRAVIS-mono, where the former employs multilingual contextual embeddings and the latter uses monolingual ones. Our systems are ranked second and third among seven submissions in the open track, respectively. Despite the strong performance of multilingual contextual embeddings across all languages, the results show that language-specific contextual embeddings have better generalization capabilities.

2019

pdf bib abs
Noisy Parallel Corpus Filtering through Projected Word Embeddings
Murathan Kurfalı | Robert Östling
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

We present a very simple method for parallel text cleaning of low-resource languages, based on projection of word embeddings trained on large monolingual corpora in high-resource languages. In spite of its simplicity, we approach the strong baseline system in the downstream machine translation evaluation.

pdf bib abs
Zero-shot transfer for implicit discourse relation classification
Murathan Kurfalı | Robert Östling
Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue

Automatically classifying the relation between sentences in a discourse is a challenging task, in particular when there is no overt expression of the relation. It becomes even more challenging by the fact that annotated training data exists only for a small number of languages, such as English and Chinese. We present a new system using zero-shot transfer learning for implicit discourse relation classification, where the only resource used for the target language is unannotated parallel text. This system is evaluated on the discourse-annotated TED-MDB parallel corpus, where it obtains good results for all seven languages using only English training data.

2018

pdf bib
Multilingual Extension of PDTB-Style Annotation: The Case of TED Multilingual Discourse Bank
Deniz Zeyrek | Amália Mendes | Murathan Kurfalı
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
An Assessment of Explicit Inter- and Intra-sentential Discourse Connectives in Turkish Discourse Bank
Deniz Zeyrek | Murathan Kurfalı
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs
Characters or Morphemes: How to Represent Words?
Ahmet Üstün | Murathan Kurfalı | Burcu Can
Proceedings of the Third Workshop on Representation Learning for NLP

In this paper, we investigate the effects of using subword information in representation learning. We argue that using syntactic subword units effects the quality of the word representations positively. We introduce a morpheme-based model and compare it against to word-based, character-based, and character n-gram level models. Our model takes a list of candidate segmentations of a word and learns the representation of the word based on different segmentations that are weighted by an attention mechanism. We performed experiments on Turkish as a morphologically rich language and English with a comparably poorer morphology. The results show that morpheme-based models are better at learning word representations of morphologically complex languages compared to character-based and character n-gram level models since the morphemes help to incorporate more syntactic knowledge in learning, that makes morpheme-based models better at syntactic tasks.

2017

pdf bib abs
TDB 1.1: Extensions on Turkish Discourse Bank
Deniz Zeyrek | Murathan Kurfalı
Proceedings of the 11th Linguistic Annotation Workshop

This paper presents the recent developments on Turkish Discourse Bank (TDB). First, the resource is summarized and an evaluation is presented. Then, TDB 1.1, i.e. enrichments on 10% of the corpus are described (namely, senses for explicit discourse connectives, and new annotations for three discourse relation types - implicit relations, entity relations and alternative lexicalizations). The method of annotation is explained and the data are evaluated.

2016

pdf bib abs
A Turkish Database for Psycholinguistic Studies Based on Frequency, Age of Acquisition, and Imageability
Elif Ahsen Acar | Deniz Zeyrek | Murathan Kurfalı | Cem Bozşahin
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This study primarily aims to build a Turkish psycholinguistic database including three variables: word frequency, age of acquisition (AoA), and imageability, where AoA and imageability information are limited to nouns. We used a corpus-based approach to obtain information about the AoA variable. We built two corpora: a child literature corpus (CLC) including 535 books written for 3-12 years old children, and a corpus of transcribed children’s speech (CSC) at ages 1;4-4;8. A comparison between the word frequencies of CLC and CSC gave positive correlation results, suggesting the usability of the CLC to extract AoA information. We assumed that frequent words of the CLC would correspond to early acquired words whereas frequent words of a corpus of adult language would correspond to late acquired words. To validate AoA results from our corpus-based approach, a rated AoA questionnaire was conducted on adults. Imageability values were collected via a different questionnaire conducted on adults. We conclude that it is possible to deduce AoA information for high frequency words with the corpus-based approach. The results about low frequency words were inconclusive, which is attributed to the fact that corpus-based AoA information is affected by the strong negative correlation between corpus frequency and rated AoA.

Co-authors

Venues

nlp4call2

repl4nlp2

bea1

cl1

isa1

wmt1