Ilknur Durgar El-Kahlout

Also published as: Ilknur Durgar El-Kahlout, İlknur Durgar El-Kahlout, İlknur Durgar El-Kahlout

2021

TR-SEQ: Named Entity Recognition Dataset for Turkish Search Engine Queries
Berkay Topçu | İlknur Durgar El-Kahlout
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Recognizing named entities in short search engine queries is a difficult task due to their weaker contextual information compared to long sentences. Standard named entity recognition (NER) systems that are trained on grammatically correct and long sentences fail to perform well on such queries. In this study, we share our efforts towards creating a cleaned and labeled dataset of real Turkish search engine queries (TR-SEQ) and introduce an extended label set to satisfy the search engine needs. A NER system is trained by applying the state-of-the-art deep learning method BERT to the collected data and its high performance on search engine queries is reported. Moreover, we compare our results with the state-of-the-art Turkish NER systems.

2019

pdf bib abs

Translating Between Morphologically Rich Languages: An Arabic-to-Turkish Machine Translation System
İlknur Durgar El-Kahlout | Emre Bektaş | Naime Şeyma Erdem | Hamza Kaya
Proceedings of the Fourth Arabic Natural Language Processing Workshop

This paper introduces the work on building a machine translation system for Arabic-to-Turkish in the news domain. Our work includes collecting parallel datasets in several ways for a new and low-resourced language pair, building baseline systems with state-of-the-art architectures and developing language specific algorithms for better translation. Parallel datasets are mainly collected three different ways; i) translating Arabic texts into Turkish by professional translators, ii) exploiting the web for open-source Arabic-Turkish parallel texts, iii) using back-translation. We per-formed preliminary experiments for Arabic-to-Turkish machine translation with neural(Marian) machine translation tools with a novel morphologically motivated vocabulary reduction method.

This paper describes the TU ̈ B ̇ITAK Turkish-English submissions in both directions for the IWSLT’13 Evaluation Campaign TED Machine Translation (MT) track. We develop both phrase-based and hierarchical phrase-based statistical machine translation (SMT) systems based on Turkish wordand morpheme-level representations. We augment training data with content words extracted from itself and experiment with reverse word order for source languages. For the Turkish-to-English direction, we use Gigaword corpus as an additional language model with the training data. For the English-to-Turkish direction, we implemented a wide coverage Turkish word generator to generate words from the stem and morpheme sequences. Finally, we perform system combination of the different systems produced with different word alignments.

2012

pdf bib abs

The TÜBİTAK statistical machine translation system for IWSLT 2012
Coşkun Mermer | Hamza Kaya | İlknur Durgar El-Kahlout | Mehmet Uğur Doğan
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

WedescribetheTU ̈B ̇ITAKsubmissiontotheIWSLT2012 Evaluation Campaign. Our system development focused on utilizing Bayesian alignment methods such as variational Bayes and Gibbs sampling in addition to the standard GIZA++ alignments. The submitted tracks are the Arabic-English and Turkish-English TED Talks translation tasks.

pdf bib abs

Turkish Paraphrase Corpus
Seniz Demir | İlknur Durgar El-Kahlout | Erdem Unal | Hamza Kaya
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Paraphrases are alternative syntactic forms in the same language expressing the same semantic content. Speakers of all languages are inherently familiar with paraphrases at different levels of granularity (lexical, phrasal, and sentential). For quite some time, the concept of paraphrasing is getting a growing attention by the research community and its potential use in several natural language processing applications (such as text summarization and machine translation) is being investigated. In this paper, we present, what is to our best knowledge, the first Turkish paraphrase corpus. The corpus is gleaned from four different sources and currently contains 1270 paraphrase pairs. All paraphrase pairs are carefully annotated by native Turkish speakers with the identified semantic correspondences between paraphrases. The work for expanding the corpus is still under way.

2010

pdf bib

LIMSI’s Statistical Translation Systems for WMT’10
Alexandre Allauzen | Josep M. Crego | İlknur Durgar El-Kahlout | François Yvon
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf bib abs

The pay-offs of preprocessing for German-English statistical machine translation
Ilknur Durgar El-Kahlout | Francois Yvon
Proceedings of the 7th International Workshop on Spoken Language Translation: Papers

In this paper, we present the result of our work on improving the preprocessing for German-English statistical machine translation. We implemented and tested various improvements aimed at i) converting German texts to the new orthographic conventions; ii) performing a new tokenization for German; iii) normalizing lexical redundancy with the help of POS tagging and morphological analysis; iv) splitting German compound words with frequency based algorithm and; v) reducing singletons and out-of-vocabulary words. All these steps are performed during preprocessing on the German side. Combining all these processes, we reduced 10% of the singletons, 2% OOV words, and obtained 1.5 absolute (7% relative) BLEU improvement on the WMT 2010 German to English News translation task.

pdf bib abs

This paper describes LIMSI’s Statistical Machine Translation systems (SMT) for the IWSLT evaluation, where we participated in two tasks (Talk for English to French and BTEC for Turkish to English). For the Talk task, we studied an extension of our in-house n-code SMT system (the integration of a bilingual reordering model over generalized translation units), as well as the use of training data extracted from Wikipedia in order to adapt the target language model. For the BTEC task, we concentrated on pre-processing schemes on the Turkish side in order to reduce the morphological discrepancies with the English side. We also evaluated the use of two different continuous space language models for such a small size of training data.

2008

pdf bib abs

BLEU+: a Tool for Fine-Grained BLEU Computation
A. Cüneyd Tantuǧ | Kemal Oflazer | Ilknur Durgar El-Kahlout
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present a tool, BLEU+, which implements various extension to BLEU computation to allow for a better understanding of the translation performance, especially for morphologically complex languages. BLEU+ takes into account both closeness in morphological structure, closeness of the root words in the WordNet hierarchy while comparing tokens in the candidate and reference sentence. In addition to gauging performance at a finer level of granularity, BLEU+ also allows the computation of various upper bound oracle scores: comparing all tokens considering only the roots allows us to get an upper bound when all errors due to morphological structure are fixed, while comparing tokens in an error-tolerant way considering minor morpheme edit operations, allows us to get a (more realistic) upper bound when tokens that differ in morpheme insertions/deletions and substitutions are fixed. We use BLEU+ in the fine-grained evaluation of the output of our English-to-Turkish statistical MT system.