Cagri Coltekin

Also published as: Çağrı Çöltekin, Cagrı Coltekin

2026

An Idiom Benchmark for Turkish
Ebru Çavuşoğlu | Cagri Coltekin
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)

Despite recent significant advances, idioms, like other forms of figurative language, present a challenge to natural language processing (NLP). Benchmark corpora are essential for improving the current models on understanding idioms. However, such corpora are only available for a limited set of languages. In this paper, we introduce our ongoing work on a benchmark corpus of Turkish idioms. Our corpus is structured for testing both idiom recognition and idiom understanding. The corpus is currently consists of 200 instances with sentences including idiomatic use, their literal paraphrases, similar sentences with no entailment, and non-idiomatic use of the idiomatic expressions when possible. We describe the methodology used to create the corpus, as well as initial experiments with a selection of LLMs.

pdf bib abs

Identifying units, ’syntactic words’, for morphosyntactic analysis is important yet challenging for morphologically rich languages. In this paper we propose a set of guiding principles to determine units of morphosyntactic analysis, and apply them to the case of copular constructions in Turkic languages, in the context of Universal Dependencies (UD) framework. We also provide a survey of the practice in the Turkic UD treebanks published to date, and discuss the advantages and disadvantages of the proposed tokenisation for a selection of Turkic languages.

2025

pdf bib abs

Multimodal Fact-Checking with Vision Language Models: A Probing Classifier based Solution with Embedding Strategies
Recep Firat Cekinel | Pinar Karagoz | Çağrı Çöltekin
Proceedings of the 31st International Conference on Computational Linguistics

This study evaluates the effectiveness of Vision Language Models (VLMs) in representing and utilizing multimodal content for fact-checking. To be more specific, we investigate whether incorporating multimodal content improves performance compared to text-only models and how well VLMs utilize text and image information to enhance misinformation detection. Furthermore we propose a probing classifier based solution using VLMs. Our approach extracts embeddings from the last hidden layer of selected VLMs and inputs them into a neural probing classifier for multi-class veracity classification. Through a series of experiments on two fact-checking datasets, we demonstrate that while multimodality can enhance performance, fusing separate embeddings from text and image encoders yielded superior results compared to using VLM embeddings. Furthermore, the proposed neural classifier significantly outperformed KNN and SVM baselines in leveraging extracted embeddings, highlighting its effectiveness for multimodal fact-checking.

pdf bib

Proceedings of the First International Workshop on Gaze Data and Natural Language Processing
Cengiz Acarturk | Jamal Nasir | Burcu Can | Cagrı Coltekin
Proceedings of the First International Workshop on Gaze Data and Natural Language Processing

pdf bib

Proceedings of the 22nd SIGMORPHON workshop on Computational Morphology, Phonology, and Phonetics
Garrett Nicolai | Eleanor Chodroff | Frederic Mailhot | Çağrı Çöltekin
Proceedings of the 22nd SIGMORPHON workshop on Computational Morphology, Phonology, and Phonetics

pdf bib

Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025)
Gosse Bouma | Çağrı Çöltekin
Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025)

pdf bib abs

Parallel Universal Dependencies Treebanks for Turkic Languages
Arofat Akhundjanova | Furkan Akkurt | Bermet Chontaeva | Soudabeh Eslami | Cagri Coltekin
Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025)

We introduce the first fully aligned and manually annotated parallel Universal Dependencies (UD) treebanks for four Turkic languages: Azerbaijani, Kyrgyz, Turkish, and Uzbek. These resources currently consist of 148 strategically selected sentences that illustrate typologically significant morphosyntactic phenomena across these related yet distinct languages. These parallel treebanks enable systematic comparative studies of Turkic syntax and may be instrumental in cross-lingual NLP applications. All treebanks are available as part of UD v2.16.

pdf bib abs

Developing a Universal Dependencies Treebank for Alaskan Gwich’in
Matthew Kirk Andrews | Cagri Coltekin
Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025)

This paper presents a Universal Dependencies (UD) treebank of Gwich’in, a severely endangered Athabascan language. The treebank, developed using instructional materials and dictionaries, includes 313 annotated sentences. This paper discusses the methodology used to construct the treebank, the linguistic challenges faced, and the implications of annotating a polysynthetic, morphologically complex language within the Universal Dependencies framework. The treebank was released with UD version 2.15 and available at https://github.com/UniversalDependencies/UD_Gwichin-TueCL/.

2024

pdf bib abs

A Treebank of Asia Minor Greek
Eleni Vligouridou | Inessa Iliadou | Çağrı Çöltekin
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Asia Minor Greek (AMG) dialects are endangered dialects rich in history and cultAsia Minor Greek (AMG) dialects are endangered dialects rich in history and cultAsia Minor Greek (AMG) dialects are endangered dialects rich in history and cultAsia Minor Greek (AMG) dialects are endangered dialects rich in history and cultAsia Minor Greek (AMG) dialects are endangered dialects rich in history and culture that face a dire struggle for preservation due to declining speaker base and scarce linguistic resources. To address this need, we introduce a Universal Dependencies treebank of Pharasiot Greek, one of the severly endangerd AMG dialects. The present treebank is fully manually annotated and currently consists of 350 sentences from six fairy tales in Pharasiot dialect. Besides describing the treebank and the annotation process, we provide and discuss interesting phenomena we observed in the treebank. Most phenomena we discuss are related to contact-induced linguistic changes that these dialects are well known for. Beyond linguistic inquiry, like other treebanks for truly low-resource languages, the AMG treebank we present offers potentials for diverse applications, such as language preservation and revitalization, as well as NLP tools that have to be developed with scarce resources.

pdf bib abs

Cross-Lingual Learning vs. Low-Resource Fine-Tuning: A Case Study with Fact-Checking in Turkish
Recep Firat Cekinel | Pinar Karagoz | Çağrı Çöltekin
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The rapid spread of misinformation through social media platforms has raised concerns regarding its impact on public opinion. While misinformation is prevalent in other languages, the majority of research in this field has concentrated on the English language. Hence, there is a scarcity of datasets for other languages, including Turkish. To address this concern, we have introduced the FCTR dataset, consisting of 3238 real-world claims. This dataset spans multiple domains and incorporates evidence collected from three Turkish fact-checking organizations. Additionally, we aim to assess the effectiveness of cross-lingual transfer learning for low-resource languages, with a particular focus on Turkish. We demonstrate in-context learning (zero-shot and few-shot) performance of large language models in this context. The experimental results indicate that the dataset has the potential to advance research in the Turkish language.

pdf bib abs

A Universal Dependencies Treebank for Gujarati
Mayank Jobanputra | Maitrey Mehta | Çağrı Çöltekin
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024

The Universal Dependencies (UD) project has presented itself as a valuable platform to develop various resources for the languages of the world. We present and release a sample treebank for the Indo-Aryan language of Gujarati – a widely spoken language with little linguistic resources. This treebank is the first labeled dataset for dependency parsing in the language and the script (the Gujarati script). The treebank contains 187 part-of-speech and dependency annotated sentences from diverse genres. We discuss various idiosyncratic examples, annotation choices and present an elaborate corpus along with agreement statistics. We see this work as a valuable resource and a stepping stone for research in Gujarati Computational Linguistics.

pdf bib abs

As part of our efforts to develop unified Universal Dependencies (UD) guidelines for Turkic languages, we evaluate multiple approaches to a difficult morphosyntactic phenomenon, pronominal locative expressions formed by a suffix -ki. These forms result in multiple syntactic words, with potentially conflicting morphological features, and participating in different dependency relations. We describe multiple approaches to the problem in current (and upcoming) Turkic UD treebanks, and show that none of them offers a solution that satisfies a number of constraints we consider (including constraints imposed by UD guidelines). This calls for a compromise with the ‘least damage’ that should be adopted by most, if not all, Turkic treebanks. Our discussion of the phenomenon and various annotation approaches may also help treebanking efforts for other languages or language families with similar constructions.

pdf bib abs

Multilingual Power and Ideology identification in the Parliament: a reference dataset and simple baselines
Çağrı Çöltekin | Matyáš Kopp | Meden Katja | Vaidas Morkevicius | Nikola Ljubešić | Tomaž Erjavec
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024

We introduce a dataset on political orientation and power position identification. The dataset is derived from ParlaMint, a set of comparable corpora of transcribed parliamentary speeches from 29 national and regional parliaments. We introduce the dataset, provide the reasoning behind some of the choices during its creation, present statistics on the dataset, and, using a simple classifier, some baseline results on predicting political orientation on the left-to-right axis, and on power position identification, i.e., distinguishing between the speeches delivered by governing coalition party members from those of opposition party members.

pdf bib abs

Tübingen-CL at SemEval-2024 Task 1: Ensemble Learning for Semantic Relatedness Estimation
Leixin Zhang | Çağrı Çöltekin
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

The paper introduces our system for SemEval-2024 Task 1, which aims to predict the relatedness of sentence pairs. Operating under the hypothesis that semantic relatedness is a broader concept that extends beyond mere similarity of sentences, our approach seeks to identify useful features for relatedness estimation. We employ an ensemble approach integrating various systems, including statistical textual features and outputs of deep learning models to predict relatedness scores. The findings suggest that semantic relatedness can be inferred from various sources and ensemble models outperform many individual systems in estimating semantic relatedness.

pdf bib

Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology
Garrett Nicolai | Eleanor Chodroff | Frederic Mailhot | Çağrı Çöltekin
Proceedings of the 21st SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib abs

NLP for Arbëresh: How an Endangered Language Learns to Write in the 21st Century
Giulio Cusenza | Çağrı Çöltekin
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024

Societies are becoming more and more connected, and minority languages often find themselves helpless against the advent of the digital age, with their speakers having to regularly turn to other languages for written communication. This work introduces the case of Arbëresh, a southern Italian language related to Albanian. It presents the very first machine-readable Arbëresh data, collected through a web campaign, and describes a set of tools developed to enable the Arbëresh people to learn how to write their language, including a spellchecker, a conjugator, a numeral generator, and an interactive platform to learn Arbëresh spelling. A comprehensive web application was set up to make these tools available to the public, as well as to collect further data through them. This method can be replicated to help revive other minority languages in a situation similar to Arbëresh’s. The main challenges of the process were the extremely low-resource setting and the variability of Arbëresh dialects.

2023

pdf bib

Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology
Garrett Nicolai | Eleanor Chodroff | Frederic Mailhot | Çağrı Çöltekin
Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib abs

This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023. The campaign is part of the tenth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2023. Three separate shared tasks were included this year: Slot and intent detection for low-resource language varieties (SID4LR), Discriminating Between Similar Languages – True Labels (DSL-TL), and Discriminating Between Similar Languages – Speech (DSL-S). All three tasks were organized for the first time this year.

2022

pdf bib abs

CoRoSeOf - An Annotated Corpus of Romanian Sexist and Offensive Tweets
Diana Constantina Hoefels | Çağrı Çöltekin | Irina Diana Mădroane
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper introduces CoRoSeOf, a large corpus of Romanian social media manually annotated for sexist and offensive language. We describe the annotation process of the corpus, provide initial analyses, and baseline classification results for sexism detection on this data set. The resulting corpus contains 39 245 tweets, annotated by multiple annotators (with an agreement rate of Fleiss’κ= 0.45), following the sexist label set of a recent study. The automatic sexism detection yields scores similar to some of the earlier studies (macro averaged F1 score of 83.07% on binary classification task). We release the corpus with a permissive license.

pdf bib abs

In ParlaMint I, a CLARIN-ERIC supported project in pandemic times, a set of comparable and uniformly annotated multilingual corpora for 17 national parliaments were developed and released in 2021. For 2022 and 2023, the project has been extended to ParlaMint II, again with the CLARIN ERIC financial support, in order to enhance the existing corpora with new data and metadata; upgrade the XML schema; add corpora for 10 new parliaments; provide more application scenarios and carry out additional experiments. The paper reports on these planned steps, including some that have already been taken, and outlines future plans.

pdf bib abs

Emotions Running High? A Synopsis of the state of Turkish Politics through the ParlaMint Corpus
Gül M. Kurtoğlu Eskişar | Çağrı Çöltekin
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference

We present the initial results of our quantitative study on emotions (Anger, Disgust, Fear, Happiness, Sadness and Surprise) in Turkish parliament (2011–2021). We use machine learning models to assign emotion scores to all speeches delivered in the parliament during this period, and observe any changes to them in relation to major political and social events in Turkey. We highlight a number of interesting observations, such as anger being the dominant emotion in parliamentary speeches, and the ruling party showing more stable emotions compared to the political opposition, despite its depiction as a populist party in the literature.

2021

pdf bib abs

Team ReadMe at CMCL 2021 Shared Task: Predicting Human Reading Patterns by Traditional Oculomotor Control Models and Machine Learning
Alisan Balkoca | Abdullah Algan | Cengiz Acarturk | Çağrı Çöltekin
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

This system description paper describes our participation in CMCL 2021 shared task on predicting human reading patterns. Our focus in this study is making use of well-known,traditional oculomotor control models and machine learning systems. We present experiments with a traditional oculomotor control model (the EZ Reader) and two machine learning models (a linear regression model and a re-current network model), as well as combining the two different models. In all experiments we test effects of features well-known in the literature for predicting reading patterns, such as frequency, word length and predictability. Our experiments support the earlier findings that such features are useful when combined. Furthermore, we show that although machine learning models perform better in comparison to traditional models, combination of both gives a consistent improvement for predicting multiple eye tracking variables during reading.

pdf bib abs

ROFF - A Romanian Twitter Dataset for Offensive Language
Mihai Manolescu | Çağrı Çöltekin
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

This paper describes the annotation process of an offensive language data set for Romanian on social media. To facilitate comparable multi-lingual research on offensive language, the annotation guidelines follow some of the recent annotation efforts for other languages. The final corpus contains 5000 micro-blogging posts annotated by a large number of volunteer annotators. The inter-annotator agreement and the initial automatic discrimination results we present are in line with earlier annotation efforts.

2020

pdf bib abs

Reproduction and Replication: A Case Study with Automatic Essay Scoring
Eva Huber | Çağrı Çöltekin
Proceedings of the Twelfth Language Resources and Evaluation Conference

As in many experimental sciences, reproducibility of experiments has gained ever more attention in the NLP community. This paper presents our reproduction efforts of an earlier study of automatic essay scoring (AES) for determining the proficiency of second language learners in a multilingual setting. We present three sets of experiments with different objectives. First, as prescribed by the LREC 2020 REPROLANG shared task, we rerun the original AES system using the code published by the original authors on the same dataset. Second, we repeat the same experiments on the same data with a different implementation. And third, we test the original system on a different dataset and a different language. Most of our findings are in line with the findings of the original paper. Nevertheless, there are some discrepancies between our results and the results presented in the original paper. We report and discuss these differences in detail. We further go into some points related to confirmation of research findings through reproduction, including the choice of the dataset, reporting and accounting for variability, use of appropriate evaluation metrics, and making code and data available. We also discuss the varying uses and differences between the terms reproduction and replication, and we argue that reproduction, the confirmation of conclusions through independent experiments in varied settings is more valuable than exact replication of the published values.

pdf bib abs

A Corpus of Turkish Offensive Language on Social Media
Çağrı Çöltekin
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper introduces a corpus of Turkish offensive language. To our knowledge, this is the first corpus of offensive language for Turkish. The corpus consists of randomly sampled micro-blog posts from Twitter. The annotation guidelines are based on a careful review of the annotation practices of recent efforts for other languages. The corpus contains 36 232 tweets sampled randomly from the Twitter stream during a period of 18 months between Apr 2018 to Sept 2019. We found approximately 19 % of the tweets in the data contain some type of offensive language, which is further subcategorized based on the target of the offense. We describe the annotation process, discuss some interesting aspects of the data, and present results of automatically classifying the corpus using state-of-the-art text classification methods. The classifiers achieve 77.3 % F1 score on identifying offensive tweets, 77.9 % F1 score on determining whether a given offensive document is targeted or not, and 53.0 % F1 score on classifying the targeted offensive documents into three subcategories.

pdf bib abs

We present the results and the main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval-2020). The task included three subtasks corresponding to the hierarchical taxonomy of the OLID schema from OffensEval-2019, and it was offered in five languages: Arabic, Danish, English, Greek, and Turkish. OffensEval-2020 was one of the most popular tasks at SemEval-2020, attracting a large number of participants across all subtasks and languages: a total of 528 teams signed up to participate in the task, 145 teams submitted official runs on the test data, and 70 teams submitted system description papers.

pdf bib abs

Verification, Reproduction and Replication of NLP Experiments: a Case Study on Parsing Universal Dependencies
Çağrı Çöltekin
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

As in any field of inquiry that depends on experiments, the verifiability of experimental studies is important in computational linguistics. Despite increased attention to verification of empirical results, the practices in the field are unclear. Furthermore, we argue, certain traditions and practices that are seemingly useful for verification may in fact be counterproductive. We demonstrate this through a set of multi-lingual experiments on parsing Universal Dependencies treebanks. In particular, we show that emphasis on exact replication leads to practices (some of which are now well established) that hide the variation in experimental results, effectively hindering verifiability with a false sense of certainty. The purpose of the present paper is to highlight the magnitude of the issues resulting from these common practices with the hope of instigating further discussion. Once we, as a community, are convinced about the importance of the problems, the solutions are rather obvious, although not necessarily easy to implement.

pdf bib abs

Dialect Identification under Domain Shift: Experiments with Discriminating Romanian and Moldavian
Çağrı Çöltekin
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes a set of experiments for discriminating between two closely related language varieties, Moldavian and Romanian, under a substantial domain shift. The experiments were conducted as part of the Romanian dialect identification task in the VarDial 2020 evaluation campaign. Our best system based on linear SVM classifier obtained the first position in the shared task with an F1 score of 0.79, supporting the earlier results showing (unexpected) success of machine learning systems in this task. The additional experiments reported in this paper also show that adapting to the test set is useful when the training data comes from another domain. However, the benefit of adaptation becomes doubtful even when a small amount of data from the target domain is available.

2019

pdf bib abs

Language Discrimination and Transfer Learning for Similar Languages: Experiments with Feature Combinations and Adaptation
Nianheng Wu | Eric DeMattos | Kwok Him So | Pin-zhen Chen | Çağrı Çöltekin
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes the work done by team tearsofjoy participating in the VarDial 2019 Evaluation Campaign. We developed two systems based on Support Vector Machines: SVM with a flat combination of features and SVM ensembles. We participated in all language/dialect identification tasks, as well as the Moldavian vs. Romanian cross-dialect topic identification (MRC) task. Our team achieved first place in German Dialect identification (GDI) and MRC subtasks 2 and 3, second place in the simplified variant of Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT) as well as Cuneiform Language Identification (CLI), and third and fifth place in DMT traditional and MRC subtask 1 respectively. In most cases, the SVM with a flat combination of features performed better than SVM ensembles. Besides describing the systems and the results obtained by them, we provide a tentative comparison between the feature combination methods, and present additional experiments with a method of adaptation to the test set, which may indicate potential pitfalls with some of the data sets.

pdf bib abs

Neural and Linear Pipeline Approaches to Cross-lingual Morphological Analysis
Çağrı Çöltekin | Jeremy Barnes
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes Tübingen-Oslo team’s participation in the cross-lingual morphological analysis task in the VarDial 2019 evaluation campaign. We participated in the shared task with a standard neural network model. Our model achieved analysis F1-scores of 31.48 and 23.67 on test languages Karachay-Balkar (Turkic) and Sardinian (Romance) respectively. The scores are comparable to the scores obtained by the other participants in both language families, and the analysis score on the Romance data set was also the best result obtained in the shared task. Besides describing the system used in our shared task participation, we describe another, simpler, model based on linear classifiers, and present further analyses using both models. Our analyses, besides revealing some of the difficult cases, also confirm that the usefulness of a source language in this task is highly correlated with the similarity of source and target languages.

pdf bib abs

Cross-lingual morphological inflection with explicit alignment
Çağrı Çöltekin
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper describes two related systems for cross-lingual morphological inflection for SIGMORPHON 2019 Shared Task participation. Both sets of results submitted to the shared task for evaluation are obtained using a simple approach of predicting transducer actions based on initial alignments on the training set, where cross-lingual transfer is limited to only using the high-resource language data as additional training set. The performance of the system does not reach the performance of the top two systems in the competition. However, we show that results can be improved with further tuning. We also present further analyses demonstrating that the cross-lingual gain is rather modest.

pdf bib

Challenges of Annotating a Code-Switching Treebank
Özlem Çetinoğlu | Çağrı Çöltekin
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)

2018

pdf bib

Tübingen-Oslo system at SIGMORPHON shared task on morphological inflection. A multi-tasking multilingual sequence to sequence model.
Taraka Rama | Çağrı Çöltekin
Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

pdf bib

pdf bib abs

Tübingen-Oslo at SemEval-2018 Task 2: SVMs perform better than RNNs in Emoji Prediction
Çağrı Çöltekin | Taraka Rama
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper describes our participation in the SemEval-2018 task Multilingual Emoji Prediction. We participated in both English and Spanish subtasks, experimenting with support vector machines (SVMs) and recurrent neural networks. Our SVM classifier obtained the top rank in both subtasks with macro-averaged F1-measures of 35.99% for English and 22.36% for Spanish data sets. Similar to a few earlier attempts, the results with neural networks were not on par with linear SVMs.

pdf bib abs

Tübingen-Oslo Team at the VarDial 2018 Evaluation Campaign: An Analysis of N-gram Features in Language Variety Identification
Çağrı Çöltekin | Taraka Rama | Verena Blaschke
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

This paper describes our systems for the VarDial 2018 evaluation campaign. We participated in all language identification tasks, namely, Arabic dialect identification (ADI), German dialect identification (GDI), discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). In all of the tasks, we only used textual transcripts (not using audio features for ADI). We submitted system runs based on support vector machine classifiers (SVMs) with bag of character and word n-grams as features, and gated bidirectional recurrent neural networks (RNNs) using units of characters and words. Our SVM models outperformed our RNN models in all tasks, obtaining the first place on the DFS task, third place on the ADI task, and second place on others according to the official rankings. As well as describing the models we used in the shared task participation, we present an analysis of the n-gram features used by the SVM models in each task, and also report additional results (that were run after the official competition deadline) on the GDI surprise dialect track.

pdf bib abs

Phonetic Vector Representations for Sound Sequence Alignment
Pavel Sofroniev | Çağrı Çöltekin
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

This study explores a number of data-driven vector representations of the IPA-encoded sound segments for the purpose of sound sequence alignment. We test the alternative representations based on the alignment accuracy in the context of computational historical linguistics. We show that the data-driven methods consistently do better than linguistically-motivated articulatory-acoustic features. The similarity scores obtained using the data-driven representations in a monolingual context, however, performs worse than the state-of-the-art distance (or similarity) scoring methods proposed in earlier studies of computational historical linguistics. We also show that adapting representations to the task at hand improves the results, yielding alignment accuracy comparable to the state of the art methods.

pdf bib abs

Identifying Depression on Reddit: The Effect of Training Data
Inna Pirina | Çağrı Çöltekin
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task

This paper presents a set of classification experiments for identifying depression in posts gathered from social media platforms. In addition to the data gathered previously by other researchers, we collect additional data from the social media platform Reddit. Our experiments show promising results for identifying depression from social media texts. More importantly, however, we show that the choice of corpora is crucial in identifying depression and can lead to misleading conclusions in case of poor choice of data.

pdf bib abs

Drug-Use Identification from Tweets with Word and Character N-Grams
Çağrı Çöltekin | Taraka Rama
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task

This paper describes our systems in social media mining for health applications (SMM4H) shared task. We participated in all four tracks of the shared task using linear models with a combination of character and word n-gram features. We did not use any external data or domain specific information. The resulting systems achieved above-average scores among other participating systems, with F1-scores of 91.22, 46.8, 42.4, and 85.53 on tasks 1, 2, 3, and 4 respectively.

pdf bib abs

We evaluate corpus-based measures of linguistic complexity obtained using Universal Dependencies (UD) treebanks. We propose a method of estimating robustness of the complexity values obtained using a given measure and a given treebank. The results indicate that measures of syntactic complexity might be on average less robust than those of morphological complexity. We also estimate the validity of complexity measures by comparing the results for very similar languages and checking for unexpected differences. We show that some of those differences that arise can be diminished by using parallel treebanks and, more importantly from the practical point of view, by harmonizing the language-specific solutions in the UD annotation.

2017

pdf bib abs

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

pdf bib

Converting the TüBa-D/Z Treebank of German to Universal Dependencies
Çağrı Çöltekin | Ben Campbell | Erhard Hinrichs | Heike Telljohann
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

pdf bib abs

Computational analysis of Gondi dialects
Taraka Rama | Çağrı Çöltekin | Pavel Sofroniev
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper presents a computational analysis of Gondi dialects spoken in central India. We present a digitized data set of the dialect area, and analyze the data using different techniques from dialectometry, deep learning, and computational biology. We show that the methods largely agree with each other and with the earlier non-computational analyses of the language group.

pdf bib abs

Tübingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing
Çağrı Çöltekin | Taraka Rama
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper describes our systems and results on VarDial 2017 shared tasks. Besides three language/dialect discrimination tasks, we also participated in the cross-lingual dependency parsing (CLP) task using a simple methodology which we also briefly describe in this paper. For all the discrimination tasks, we used linear SVMs with character and word features. The system achieves competitive results among other systems in the shared task. We also report additional experiments with neural network models. The performance of neural network models was close but always below the corresponding SVM classifiers in the discrimination tasks. For the cross-lingual parsing task, we experimented with an approach based on automatically translating the source treebank to the target language, and training a parser on the translated treebank. We used off-the-shelf tools for both translation and parsing. Despite achieving better-than-baseline results, our scores in CLP tasks were substantially lower than the scores of the other participants.

pdf bib abs

Fewer features perform well at Native Language Identification task
Taraka Rama | Çağrı Çöltekin
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

This paper describes our results at the NLI shared task 2017. We participated in essays, speech, and fusion task that uses text, speech, and i-vectors for the task of identifying the native language of the given input. In the essay track, a linear SVM system using word bigrams and character 7-grams performed the best. In the speech track, an LDA classifier based only on i-vectors performed better than a combination system using text features from speech transcriptions and i-vectors. In the fusion task, we experimented with systems that used combination of i-vectors with higher order n-grams features, combination of i-vectors with word unigrams, a mean probability ensemble, and a stacked ensemble system. Our finding is that word unigrams in combination with i-vectors achieve higher score than systems trained with larger number of n-gram features. Our best-performing systems achieved F1-scores of 87.16%, 83.33% and 91.75% on the essay track, the speech track and the fusion track respectively.

2016

pdf bib abs

The Universal Dependencies (UD) project was conceived after the substantial recent interest in unifying annotation schemes across languages. With its own annotation principles and abstract inventory for parts of speech, morphosyntactic features and dependency relations, UD aims to facilitate multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. This paper presents the Turkish IMST-UD Treebank, the first Turkish treebank to be in a UD release. The IMST-UD Treebank was automatically converted from the IMST Treebank, which was also recently released. We describe this conversion procedure in detail, complete with mapping tables. We also present our evaluation of the parsing performances of both versions of the IMST Treebank. Our findings suggest that the UD framework is at least as viable for Turkish as the original annotation framework of the IMST Treebank.

pdf bib

Part of Speech Annotation of a Turkish-German Code-Switching Corpus
Özlem Çetinoğlu | Çağrı Çöltekin
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

pdf bib

Learning Phone Embeddings for Word Segmentation of Child-Directed Speech
Jianqiang Ma | Çağrı Çöltekin | Erhard Hinrichs
Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning

pdf bib abs

Discriminating Similar Languages with Linear SVMs and Neural Networks
Çağrı Çöltekin | Taraka Rama
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

This paper describes the systems we experimented with for participating in the discriminating between similar languages (DSL) shared task 2016. We submitted results of a single system based on support vector machines (SVM) with linear kernel and using character ngram features, which obtained the first rank at the closed training track for test set A. Besides the linear SVM, we also report additional experiments with a number of deep learning architectures. Despite our intuition that non-linear deep learning methods should be advantageous, linear models seems to fare better in this task, at least with the amount of data and the amount of effort we spent on tuning these models.

pdf bib abs

LSTM Autoencoders for Dialect Analysis
Taraka Rama | Çağrı Çöltekin
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

Computational approaches for dialectometry employed Levenshtein distance to compute an aggregate similarity between two dialects belonging to a single language group. In this paper, we apply a sequence-to-sequence autoencoder to learn a deep representation for words that can be used for meaningful comparison across dialects. In contrast to the alignment-based methods, our method does not require explicit alignments. We apply our architectures to three different datasets and show that the learned representations indicate highly similar results with the analyses based on Levenshtein distance and capture the traditional dialectal differences shown by dialectologists.

2015

pdf bib

Units in segmentation: a computational investigation
Çağrı Çöltekin
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning

2014

pdf bib abs

A set of open source tools for Turkish natural language processing
Çağrı Çöltekin
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper introduces a set of freely available, open-source tools for Turkish that are built around TRmorph, a morphological analyzer introduced earlier in Coltekin (2010). The article first provides an update on the analyzer, which includes a complete rewrite using a different finite-state description language and tool set as well as major tagset changes to comply better with the state-of-the-art computational processing of Turkish and the user requests received so far. Besides these major changes to the analyzer, this paper introduces tools for morphological segmentation, stemming and lemmatization, guessing unknown words, grapheme to phoneme conversion, hyphenation and a morphological disambiguation.

pdf bib

An explicit statistical model of learning lexical segmentation using multiple cues
Çağrı Çöltekin | John Nerbonne
Proceedings of the 5th Workshop on Cognitive Aspects of Computational Language Learning (CogACLL)

2012

pdf bib

Detecting Shibboleths
Jelena Prokić | Çağrı Çöltekin | John Nerbonne
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH

2010

pdf bib abs

A Freely Available Morphological Analyzer for Turkish
Çağrı Çöltekin
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presents TRmorph, a two-level morphological analyzer for Turkish. TRmorph is a fairly complete and accurate morphological analyzer for Turkish. However, strength of TRmorph is neither in its performance, nor in its novelty. The main feature of this analyzer is its availability. It has completely been implemented using freely available tools and resources, and the two-level description is also distributed with a license that allows others to use and modify it freely for different applications. To our knowledge, TRmorph is the first freely available morphological analyzer for Turkish. This makes TRmorph particularly suitable for applications where the analyzer has to be changed in some way, or as a starting point for morphological analyzers for similar languages. TRmorph's specification of Turkish morphology is relatively complete, and it is distributed with a large lexicon. Along with the description of how the analyzer is implemented, this paper provides an evaluation of the analyzer on two large corpora.