Oğuz Kerem Yıldız


2023

pdf bib
A CCGbank for Turkish: From Dependency to CCG
Aslı Kuzgun | Oğuz Kerem Yıldız | Olcay Taner Yildiz
Proceedings of the 12th Global Wordnet Conference

In this paper, we present the building of a CCGbank for Turkish by using standardised dependency corpora. We automatically induce Combinatory Categorial Grammar (CCG) categories for each word token in the Turkish dependency corpora. The CCG induction algorithm we present here is based on the dependency relations that are defined in the latest release of the Universal Dependencies (UD) framework. We aim for an algorithm that can easily be used in all the Turkish treebanks that are annotated in this framework. Therefore, we employ a lexicalist approach in order to make full use of the dependency relations while creating a semantically transparent corpus. We present the treebanks we employed in this study as well as their annotation framework. We introduce the structure of the algorithm we used along with the specific issues that are different from previous studies. Lastly, we show how the results change with this lexical approach in CCGbank for Turkish compared to the previous CCGbank studies in Turkish.

2021

pdf bib
From Constituency to UD-Style Dependency: Building the First Conversion Tool of Turkish
Aslı Kuzgun | Oğuz Kerem Yıldız | Neslihan Cesur | Büşra Marşan | Arife Betül Yenice | Ezgi Sanıyar | Oguzhan Kuyrukçu | Bilge Nas Arıcan | Olcay Taner Yıldız
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

This paper deliberates on the process of building the first constituency-to-dependency conversion tool of Turkish. The starting point of this work is a previous study in which 10,000 phrase structure trees were manually transformed into Turkish from the original PennTreebank corpus. Within the scope of this project, these Turkish phrase structure trees were automatically converted into UD-style dependency structures, using both a rule-based algorithm and a machine learning algorithm specific to the requirements of the Turkish language. The results of both algorithms were compared and the machine learning approach proved to be more accurate than the rule-based algorithm. The output was revised by a team of linguists. The refined versions were taken as gold standard annotations for the evaluation of the algorithms. In addition to its contribution to the UD Project with a large dataset of 10,000 Turkish dependency trees, this project also fulfills the important gap of a Turkish conversion tool, enabling the quick compilation of dependency corpora which can be used for the training of better dependency parsers.