Necva Bölücü

Also published as: Necva Bolucu

2026

Using a Human-AI Teaming Approach to Create and Curate Scientific Datasets with the SciLire System
Necva Bölücü | Jessica Irons | Changhyun Lee | Brian Jin | Maciej Rybinski | Huichen Yang | Andreas Duenser | Stephen Wan
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)

The rapid growth of scientific literature has made manual extraction of structured knowledge increasingly impractical. To address this challenge, we introduce SCILIRE, a system for creating datasets from scientific literature. SCILIRE has been designed around Human-AI teaming principles centred on workflows for verifying and curating data. It facilitates an iterative workflow in which researchers can review and correct AI outputs. Furthermore, this interaction is used as a feedback signal to improve future LLM-based inference. We evaluate our design using a combination of intrinsic benchmarking outcomes together with real-world case studies across multiple domains. The results demonstrate that SCILIRE improves extraction fidelity and facilitates efficient dataset creation.

pdf bib abs

MetaLead: A Comprehensive Human-Curated Leaderboard Dataset for Transparent Reporting of Machine Learning Experiments
Roelien C. Timmer | Necva Bölücü | Stephen Wan
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Leaderboards are crucial in the machine learning (ML) domain for benchmarking and tracking progress. However, creating leaderboards traditionally demands significant manual effort. In recent years, efforts have been made to automate leaderboard generation, but existing datasets for this purpose are limited by capturing only the best results from each paper and limited metadata. We present MetaLead, a fully human-annotated ML Leaderboard dataset that captures all experimental results for result transparency and contains extra metadata, such as the result experimental type (baseline, proposed method, or variation of proposed method) for experiment-type guided comparisons, and explicitly separates train and test dataset for cross-domain assessment. This enriched structure makes MetaLead a powerful resource for more transparent and nuanced evaluations across ML research. MetaLead dataset and code repository: https://github.com/RoelTim/metalead

2025

pdf bib abs

This study investigates the extent to which linguistic typology influences the performance of two automatic speech recognition (ASR) systems across diverse language families. Using the FLEURS corpus and typological features from the World Atlas of Language Structures (WALS), we analysed 40 languages grouped by phonological, morphological, syntactic, and semantic domains. We evaluated two state-of-the-art multilingual ASR systems, Whisper and Seamless, to examine how their performance, measured by word error rate (WER), correlates with linguistic structures. Random Forests and Mixed Effects Models were used to quantify feature impact and statistical significance. Results reveal that while both systems leverage typological patterns, they differ in their sensitivity to specific domains. Our findings highlight how structural and functional linguistic features shape ASR performance, offering insights into model generalisability and typology-aware system development.

pdf bib abs

A Morpheme-Aware Child-Inspired Language Model
Necva Bölücü | Burcu Can
Proceedings of the First BabyLM Workshop

Most tokenization methods in language models rely on subword units that lack explicit linguistic correspondence. In this work, we investigate the impact of using morpheme-based tokens in a small language model, comparing them to the widely used frequency-based method, BPE. We apply the morpheme-based tokenization method to both 10-million and 100-million word datasets from the BabyLM Challenge. Our results show that using a morphological tokenizer improves EWoK (basic world knowledge) performance by around 20% and entity tracking by around 40%, highlighting the impact of morphological information in developing smaller language models. We also apply curriculum learning, in which morphological information is gradually introduced during training, mirroring the vocabulary-building stage in infants that precedes morphological processing. The results are consistent with previous research: curriculum learning yields slight improvements for some tasks, but performance degradation in others.

pdf bib abs

Bridging the Gap: Instruction-Tuned LLMs for Scientific Named Entity Recognition
Necva Bölücü | Maciej Rybinski | Stephen Wan
Proceedings of the Third Workshop for Artificial Intelligence for Scientific Publications

Information extraction (IE) from scientific literature plays an important role in many information-seeking pipelines. Large Language Models (LLMs) have demonstrated strong zero-shot and few-shot performance on IE tasks. However, there are challenges in practical deployment, especially in scenarios that involve sensitive information, such as industrial research or limited budgets. A key question is whether there is a need for a fine-tuned model for optimal domain adaptation (i.e., whether in-domain labelled training data is needed, or zero-shot to few-shot effectiveness is enough). In this paper, we explore this question in the context of IE on scientific literature. We further consider methodological questions, such as alternatives to cloud-based proprietary LLMs (e.g., GPT and Claude) when these are unsuitable due to data privacy, data sensitivity, or cost reasons. This paper outlines empirical results to recommend which locally hosted open-source LLM approach to adopt and illustrates the trade-offs in domain adaptation.

pdf bib abs

Do We Really Need All Those Dimensions? An Intrinsic Evaluation Framework for Compressed Embeddings
Nathan Inkiriwang | Necva Bölücü | Garth Tarr | Maciej Rybinski
Findings of the Association for Computational Linguistics: EMNLP 2025

High-dimensional text embeddings are foundational to modern NLP but costly to store and use. While embedding compression addresses these challenges, selecting the best compression method remains difficult. Existing evaluation methods for compressed embeddings are either expensive or too simplistic. We introduce a comprehensive intrinsic evaluation framework featuring a suite of task-agnostic metrics that together provide a reliable proxy for downstream performance. A key contribution is EOS_k, a novel spectral fidelity measure specifically designed to be robust to embedding anisotropy. Through extensive experiments on diverse embeddings across four downstream tasks, we demonstrate that our intrinsic metrics reliably predict extrinsic performance and reveal how different embedding architectures depend on distinct geometric properties. Our framework provides a practical, efficient, and interpretable alternative to standard evaluations for compressed embeddings.

pdf bib abs

CSIRO-LT at SemEval-2025 Task 11: Adapting LLMs for Emotion Recognition for Multiple Languages
Jiyu Chen | Necva Bölücü | Sarvnaz Karimi | Diego Molla | Cecile Paris
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Detecting emotions across different languages is challenging due to the varied and culturally nuanced ways of emotional expressions. The Semeval 2025 Task 11: Bridging the Gap in Text-Based emotion shared task was organised to investigate emotion recognition across different languages. The goal of the task is to implement an emotion recogniser that can identify the basic emotional states that general third-party observers would attribute to an author based on their written text snippet, along with the intensity of those emotions. We report our investigation of various task-adaptation strategies for LLMs in emotion recognition. We show that the most effective method for this task is to fine-tune a pre-trained multilingual LLM for each language.

2024

pdf bib abs

Finding evidence for claims from content presented in experimental results of scientific articles is difficult. The evidence is often presented in the form of tables and figures, and correctly matching it to scientific claims presents automation challenges. The Context24 shared task is launched to support the development of systems able to verify claims by extracting supporting evidence from articles. We explore different facets of this shared task modelled as a search problem and as an information extraction task. We experiment with a range of methods in each of these categories for the two sub-tasks of evidence identification and grounding context identification in the Context24 shared task.

2023

pdf bib

Investigating the Impact of Syntax-Enriched Transformers on Quantity Extraction in Scientific Texts
Necva Bölücü | Maciej Rybinski | Stephen Wan
Proceedings of the Second Workshop on Information Extraction from Scientific Publications

pdf bib abs

Which Sentence Representation is More Informative: An Analysis on Text Classification
Necva Bölücü | Burcu Can
Proceedings of the Seventh International Conference on Dependency Linguistics (Depling, GURT/SyntaxFest 2023)

Text classification is a popular and well-studied problem in Natural Language Processing. Most previous work on text classification has focused on deep neural networks such as LSTMs and CNNs. However, text classification studies using syntactic and semantic information are very limited in the literature. In this study, we propose a model using Graph Attention Network (GAT) that incorporates semantic and syntactic information as input for the text classification task. The semantic representations of UCCA and AMR are used as semantic information and the dependency tree is used as syntactic information. Extensive experimental results and in-depth analysis show that UCCA-GAT model, which is a semantic-aware model outperforms the AMR-GAT and DEP-GAT, which are semantic and syntax-aware models respectively. We also provide a comprehensive analysis of the proposed model to understand the limitations of the representations for the problem.

pdf bib abs

impact of sample selection on in-context learning for entity extraction from scientific writing
Necva Bölücü | Maciej Rybinski | Stephen Wan
Findings of the Association for Computational Linguistics: EMNLP 2023

Prompt-based usage of Large Language Models (LLMs) is an increasingly popular way to tackle many well-known natural language problems. This trend is due, in part, to the appeal of the In-Context Learning (ICL) prompt set-up, in which a few selected training examples are provided along with the inference request. ICL, a type of few-shot learning, is especially attractive for natural language processing (NLP) tasks defined for specialised domains, such as entity extraction from scientific documents, where the annotation is very costly due to expertise requirements for the annotators. In this paper, we present a comprehensive analysis of in-context sample selection methods for entity extraction from scientific documents using GPT-3.5 and compare these results against a fully supervised transformer-based baseline. Our results indicate that the effectiveness of the in-context sample selection methods is heavily domain-dependent, but the improvements are more notable for problems with a larger number of entity types. More in-depth analysis shows that ICL is more effective for low-resource set-ups of scientific information extraction

2022

pdf bib

Automatic Classification of Evidence Based Medicine Using Transformers
Necva Bolucu | Pinar Uskaner Hepsag
Proceedings of the 20th Annual Workshop of the Australasian Language Technology Association

pdf bib abs

TurkishDelightNLP: A Neural Turkish NLP Toolkit
Huseyin Alecakir | Necva Bölücü | Burcu Can
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations

We introduce a neural Turkish NLP toolkit called TurkishDelightNLP that performs computational linguistic analyses from morphological level to semantic level that involves tasks such as stemming, morphological segmentation, morphological tagging, part-of-speech tagging, dependency parsing, and semantic parsing, as well as high-level NLP tasks such as named entity recognition. We publicly share the open-source Turkish NLP toolkit through a web interface that allows an input text to be analysed in real-time, as well as the open source implementation of the components provided in the toolkit, an API, and several annotated datasets such as word similarity test set to evaluate word embeddings and UCCA-based semantic annotation in Turkish. This will be the first open-source Turkish NLP toolkit that involves a range of NLP tasks in all levels. We believe that it will be useful for other researchers in Turkish NLP and will be also beneficial for other high-level NLP tasks in Turkish.

pdf bib abs

Analysing Syntactic and Semantic Features in Pre-trained Language Models in a Fully Unsupervised Setting
Necva Bölücü | Burcu Can
Proceedings of the 19th International Conference on Natural Language Processing (ICON)

Transformer-based pre-trained language models (PLMs) have been used in all NLP tasks and resulted in a great success. This has led to the question of whether we can transfer this knowledge to syntactic or semantic parsing in a completely unsupervised setting. In this study, we leverage PLMs as a source of external knowledge to perform a fully unsupervised parser model for semantic, constituency and dependency parsing. We analyse the results for English, German, French, and Turkish to understand the impact of the PLMs on different languages for syntactic and semantic parsing. We visualize the attention layers and heads in PLMs for parsing to understand the information that can be learned throughout the layers and the attention heads in the PLMs both for different levels of parsing tasks. The results obtained from dependency, constituency, and semantic parsing are similar to each other, and the middle layers and the ones closer to the final layers have more syntactic and semantic information.

pdf bib abs

Turkish Universal Conceptual Cognitive Annotation
Necva Bölücü | Burcu Can
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Universal Conceptual Cognitive Annotation (UCCA) (Abend and Rappoport, 2013a) is a cross-lingual semantic annotation framework that provides an easy annotation without any requirement for linguistic background. UCCA-annotated datasets have been already released in English, French, and German. In this paper, we introduce the first UCCA-annotated Turkish dataset that currently involves 50 sentences obtained from the METU-Sabanci Turkish Treebank (Atalay et al., 2003; Oflazeret al., 2003). We followed a semi-automatic annotation approach, where an external semantic parser is utilised for an initial annotation of the dataset, which is partially accurate and requires refinement. We manually revised the annotations obtained from the semantic parser that are not in line with the UCCA rules that we defined for Turkish. We used the same external semantic parser for evaluation purposes and conducted experiments with both zero-shot and few-shot learning. While the parser cannot predict remote edges in zero-shot setting, using even a small subset of training data in few-shot setting increased the overall F-1 score including the remote edges. This is the initial version of the annotated dataset and we are currently extending the dataset. We will release the current Turkish UCCA annotation guideline along with the annotated dataset.

Co-authors

Venues

sdp1