Burcu Can

2025

A Morpheme-Aware Child-Inspired Language Model
Necva Bölücü | Burcu Can
Proceedings of the First BabyLM Workshop

Most tokenization methods in language models rely on subword units that lack explicit linguistic correspondence. In this work, we investigate the impact of using morpheme-based tokens in a small language model, comparing them to the widely used frequency-based method, BPE. We apply the morpheme-based tokenization method to both 10-million and 100-million word datasets from the BabyLM Challenge. Our results show that using a morphological tokenizer improves EWoK (basic world knowledge) performance by around 20% and entity tracking by around 40%, highlighting the impact of morphological information in developing smaller language models. We also apply curriculum learning, in which morphological information is gradually introduced during training, mirroring the vocabulary-building stage in infants that precedes morphological processing. The results are consistent with previous research: curriculum learning yields slight improvements for some tasks, but performance degradation in others.

pdf bib

Proceedings of the First International Workshop on Gaze Data and Natural Language Processing
Cengiz Acarturk | Jamal Nasir | Burcu Can | Cagrı Coltekin
Proceedings of the First International Workshop on Gaze Data and Natural Language Processing

2024

pdf bib abs

Error Analysis of NLP Models and Non-Native Speakers of English Identifying Sarcasm in Reddit Comments
Oliver Cakebread-Andrews | Le An Ha | Ingo Frommholz | Burcu Can
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper summarises the differences and similarities found between humans and three natural language processing models when attempting to identify whether English online comments are sarcastic or not. Three models were used to analyse 300 comments from the FigLang 2020 Reddit Dataset, with and without context. The same 300 comments were also given to 39 non-native speakers of English and the results were compared. The aim was to find whether there were any results that could be applied to English as a Foreign Language (EFL) teaching. The results showed that there were similarities between the models and non-native speakers, in particular the logistic regression model. They also highlighted weaknesses with both non-native speakers and the models in detecting sarcasm when the comments included political topics or were phrased as questions. This has potential implications for how the EFL teaching industry could implement the results of error analysis of NLP models in teaching practices.

pdf bib abs

Forged-GAN-BERT: Authorship Attribution for LLM-Generated Forged Novels
Kanishka Silva | Ingo Frommholz | Burcu Can | Fred Blain | Raheem Sarwar | Laura Ugolini
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

The advancement of generative Large Language Models (LLMs), capable of producing human-like texts, introduces challenges related to the authenticity of the text documents. This requires exploring potential forgery scenarios within the context of authorship attribution, especially in the literary domain. Particularly,two aspects of doubted authorship may arise in novels, as a novel may be imposed by a renowned author or include a copied writing style of a well-known novel. To address these concerns, we introduce Forged-GAN-BERT, a modified GANBERT-based model to improve the classification of forged novels in two data-augmentation aspects: via the Forged Novels Generator (i.e., ChatGPT) and the generator in GAN. Compared to other transformer-based models, the proposed Forged-GAN-BERT model demonstrates an improved performance with F1 scores of 0.97 and 0.71 for identifying forged novels in single-author and multi-author classification settings. Additionally, we explore different prompt categories for generating the forged novels to analyse the quality of the generated texts using different similarity distance measures, including ROUGE-1, Jaccard Similarity, Overlap Confident, and Cosine Similarity.

2023

pdf bib abs

Authorship Attribution of Late 19th Century Novels using GAN-BERT
Kanishka Silva | Burcu Can | Frédéric Blain | Raheem Sarwar | Laura Ugolini | Ruslan Mitkov
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Authorship attribution aims to identify the author of an anonymous text. The task becomes even more worthwhile when it comes to literary works. For example, pen names were commonly used by female authors in the 19th century resulting in some literary works being incorrectly attributed or claimed. With this motivation, we collated a dataset of late 19th century novels in English. Due to the imbalance in the dataset and the unavailability of enough data per author, we employed the GANBERT model along with data sampling strategies to fine-tune a transformer-based model for authorship attribution. Differently from the earlier studies on the GAN-BERT model, we conducted transfer learning on comparatively smaller author subsets to train more focused author-specific models yielding performance over 0.88 accuracy and F1 scores. Furthermore, we observed that increasing the sample size has a negative impact on the model’s performance. Our research mainly contributes to the ongoing authorship attribution research using GAN-BERT architecture, especially in attributing disputed novelists in the late 19th century.

pdf bib abs

Which Sentence Representation is More Informative: An Analysis on Text Classification
Necva Bölücü | Burcu Can
Proceedings of the Seventh International Conference on Dependency Linguistics (Depling, GURT/SyntaxFest 2023)

Text classification is a popular and well-studied problem in Natural Language Processing. Most previous work on text classification has focused on deep neural networks such as LSTMs and CNNs. However, text classification studies using syntactic and semantic information are very limited in the literature. In this study, we propose a model using Graph Attention Network (GAT) that incorporates semantic and syntactic information as input for the text classification task. The semantic representations of UCCA and AMR are used as semantic information and the dependency tree is used as syntactic information. Extensive experimental results and in-depth analysis show that UCCA-GAT model, which is a semantic-aware model outperforms the AMR-GAT and DEP-GAT, which are semantic and syntax-aware models respectively. We also provide a comprehensive analysis of the proposed model to understand the limitations of the representations for the problem.

pdf bib

2022

pdf bib abs

TurkishDelightNLP: A Neural Turkish NLP Toolkit
Huseyin Alecakir | Necva Bölücü | Burcu Can
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations

We introduce a neural Turkish NLP toolkit called TurkishDelightNLP that performs computational linguistic analyses from morphological level to semantic level that involves tasks such as stemming, morphological segmentation, morphological tagging, part-of-speech tagging, dependency parsing, and semantic parsing, as well as high-level NLP tasks such as named entity recognition. We publicly share the open-source Turkish NLP toolkit through a web interface that allows an input text to be analysed in real-time, as well as the open source implementation of the components provided in the toolkit, an API, and several annotated datasets such as word similarity test set to evaluate word embeddings and UCCA-based semantic annotation in Turkish. This will be the first open-source Turkish NLP toolkit that involves a range of NLP tasks in all levels. We believe that it will be useful for other researchers in Turkish NLP and will be also beneficial for other high-level NLP tasks in Turkish.

pdf bib abs

Analysing Syntactic and Semantic Features in Pre-trained Language Models in a Fully Unsupervised Setting
Necva Bölücü | Burcu Can
Proceedings of the 19th International Conference on Natural Language Processing (ICON)

Transformer-based pre-trained language models (PLMs) have been used in all NLP tasks and resulted in a great success. This has led to the question of whether we can transfer this knowledge to syntactic or semantic parsing in a completely unsupervised setting. In this study, we leverage PLMs as a source of external knowledge to perform a fully unsupervised parser model for semantic, constituency and dependency parsing. We analyse the results for English, German, French, and Turkish to understand the impact of the PLMs on different languages for syntactic and semantic parsing. We visualize the attention layers and heads in PLMs for parsing to understand the information that can be learned throughout the layers and the attention heads in the PLMs both for different levels of parsing tasks. The results obtained from dependency, constituency, and semantic parsing are similar to each other, and the middle layers and the ones closer to the final layers have more syntactic and semantic information.

pdf bib

Comparison of Token- and Character-Level Approaches to Restoration of Spaces, Punctuation, and Capitalization in Various Languages
Laurence Dyer | Anthony Hughes | Dhwani Shah | Burcu Can
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)

pdf bib abs

Turkish Universal Conceptual Cognitive Annotation
Necva Bölücü | Burcu Can
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Universal Conceptual Cognitive Annotation (UCCA) (Abend and Rappoport, 2013a) is a cross-lingual semantic annotation framework that provides an easy annotation without any requirement for linguistic background. UCCA-annotated datasets have been already released in English, French, and German. In this paper, we introduce the first UCCA-annotated Turkish dataset that currently involves 50 sentences obtained from the METU-Sabanci Turkish Treebank (Atalay et al., 2003; Oflazeret al., 2003). We followed a semi-automatic annotation approach, where an external semantic parser is utilised for an initial annotation of the dataset, which is partially accurate and requires refinement. We manually revised the annotations obtained from the semantic parser that are not in line with the UCCA rules that we defined for Turkish. We used the same external semantic parser for evaluation purposes and conducted experiments with both zero-shot and few-shot learning. While the parser cannot predict remote edges in zero-shot setting, using even a small subset of training data in few-shot setting increased the overall F-1 score including the remote edges. This is the initial version of the annotated dataset and we are currently extending the dataset. We will release the current Turkish UCCA annotation guideline along with the annotated dataset.

pdf bib

2021

pdf bib abs

Bilingual Terminology Extraction Using Neural Word Embeddings on Comparable Corpora
Darya Filippova | Burcu Can | Gloria Corpas Pastor
Proceedings of the Student Research Workshop Associated with RANLP 2021

Term and glossary management are vital steps of preparation of every language specialist, and they play a very important role at the stage of education of translation professionals. The growing trend of efficient time management and constant time constraints we may observe in every job sector increases the necessity of the automatic glossary compilation. Many well-performing bilingual AET systems are based on processing parallel data, however, such parallel corpora are not always available for a specific domain or a language pair. Domain-specific, bilingual access to information and its retrieval based on comparable corpora is a very promising area of research that requires a detailed analysis of both available data sources and the possible extraction techniques. This work focuses on domain-specific automatic terminology extraction from comparable corpora for the English – Russian language pair by utilizing neural word embeddings.

2020

pdf bib abs

Self Attended Stack-Pointer Networks for Learning Long Term Dependencies
Salih Tuc | Burcu Can
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

We propose a novel deep neural architecture for dependency parsing, which is built upon a Transformer Encoder (Vaswani et al. 2017) and a Stack Pointer Network (Ma et al. 2018). We first encode each sentence using a Transformer Network and then the dependency graph is generated by a Stack Pointer Network by selecting the head of each word in the sentence through a head selection process. We evaluate our model on Turkish and English treebanks. The results show that our trasformer-based model learns long term dependencies efficiently compared to sequential models such as recurrent neural networks. Our self attended stack pointer network improves UAS score around 6% upon the LSTM based stack pointer (Ma et al. 2018) for Turkish sentences with a length of more than 20 words.

2019

pdf bib

2018

pdf bib abs

Tree Structured Dirichlet Processes for Hierarchical Morphological Segmentation
Burcu Can | Suresh Manandhar
Computational Linguistics, Volume 44, Issue 2 - June 2018

This article presents a probabilistic hierarchical clustering model for morphological segmentation. In contrast to existing approaches to morphology learning, our method allows learning hierarchical organization of word morphology as a collection of tree structured paradigms. The model is fully unsupervised and based on the hierarchical Dirichlet process. Tree hierarchies are learned along with the corresponding morphological paradigms simultaneously. Our model is evaluated on Morpho Challenge and shows competitive performance when compared to state-of-the-art unsupervised morphological segmentation systems. Although we apply this model for morphological segmentation, the model itself can also be used for hierarchical clustering of other types of data.

pdf bib abs

Characters or Morphemes: How to Represent Words?
Ahmet Üstün | Murathan Kurfalı | Burcu Can
Proceedings of the Third Workshop on Representation Learning for NLP

In this paper, we investigate the effects of using subword information in representation learning. We argue that using syntactic subword units effects the quality of the word representations positively. We introduce a morpheme-based model and compare it against to word-based, character-based, and character n-gram level models. Our model takes a list of candidate segmentations of a word and learns the representation of the word based on different segmentations that are weighted by an attention mechanism. We performed experiments on Turkish as a morphologically rich language and English with a comparably poorer morphology. The results show that morpheme-based models are better at learning word representations of morphologically complex languages compared to character-based and character n-gram level models since the morphemes help to incorporate more syntactic knowledge in learning, that makes morpheme-based models better at syntactic tasks.

Burcu Can

2025

2024

2023

2022

2021

2020

2019

2018

2013

2012

Co-authors

Venues