Arzucan Özgür

Also published as: Arzucan Ozgur

2025

HATECAT-TR: A Hate Speech Span Detection and Categorization Dataset for Turkish
Hasan Kerem Şeker | Gökçe Uludoğan | Pelin Önal | Arzucan Özgür
Findings of the Association for Computational Linguistics: EMNLP 2025

Hate speech on social media in Turkey remains a critical issue, frequently targeting minority groups. Effective moderation requires not only detecting hateful posts but also identifying the specific hateful expressions within them. To address this, we introduce HATECAT-TR, a span-annotated dataset of Turkish tweets, containing 4465 hateful spans across 2981 posts, each directed at one of eight minority groups. Annotations were created using a semi-automated approach, combining GPT-4o-generated spans with human expert review to ensure accuracy. Each hateful span is categorized into one of five discourse types, enabling a fine-grained analysis of the nature and intent behind hateful content. We frame span detection as binary and multi-class token classification tasks and utilize the state-of-the-art language models to establish a baseline performance for the new dataset. Our findings highlight the challenges of detecting and categorizing implicit hate speech, particularly when spans are subtle and highly contextual. The source code is available at github.com/boun-tabi/hatecat-tr and HATECAT-TR can be shared by complying with the terms of X upon contacting the authors.

2024

pdf bib abs

Evaluating the Quality of a Corpus Annotation Scheme Using Pretrained Language Models
Furkan Akkurt | Onur Gungor | Büşra Marşan | Tunga Gungor | Balkiz Ozturk Basaran | Arzucan Özgür | Susan Uskudarli
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Pretrained language models and large language models are increasingly used to assist in a great variety of natural language tasks. In this work, we explore their use in evaluating the quality of alternative corpus annotation schemes. For this purpose, we analyze two alternative annotations of the Turkish BOUN treebank, versions 2.8 and 2.11, in the Universal Dependencies framework using large language models. Using a suitable prompt generated using treebank annotations, large language models are used to recover the surface forms of sentences. Based on the idea that the large language models capture the characteristics of the languages, we expect that the better annotation scheme would yield the sentences with higher success. The experiments conducted on a subset of the treebank show that the new annotation scheme (2.11) results in a successful recovery percentage of about 2 points higher. All the code developed for this work is available at https://github.com/boun-tabi/eval-ud .

pdf bib abs

Detecting Hate Speech in Turkish Print Media: A Corpus and A Hybrid Approach with Target-oriented Linguistic Knowledge
Gökçe Uludoğan | Atıf Emre Yüksel | Ümit Tunçer | Burak Işık | Yasemin Korkmaz | Didar Akar | Arzucan Özgür
Proceedings of the 7th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2024)

The use of hate speech targeting ethnicity, nationalities, religious identities, and specific groups has been on the rise in the news media. However, most existing automatic hate speech detection models focus on identifying hate speech, often neglecting the target group-specific language that is common in news articles. To address this problem, we first compile a hate speech dataset, TurkishHatePrintCorpus, derived from Turkish news articles and annotate it specifically for the language related to the targeted group. We then introduce the HateTargetBERT model, which integrates the target-centric linguistic features extracted in this study into the BERT model, and demonstrate its effectiveness in detecting hate speech while allowing the model’s classification decision to be explained. We have made the dataset and source code publicly available at url{https://github.com/boun-tabi/HateTargetBERT-TR}.

pdf bib abs

Dealing with Data Scarcity in Spoken Question Answering
Merve Ünlü Menevşe | Yusufcan Manav | Ebru Arisoy | Arzucan Özgür
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper focuses on dealing with data scarcity in spoken question answering (QA) using automatic question-answer generation and a carefully selected fine-tuning strategy that leverages limited annotated data (paragraphs and question-answer pairs). Spoken QA is a challenging task due to using spoken documents, i.e., erroneous automatic speech recognition (ASR) transcriptions, and the scarcity of spoken QA data. We propose a framework for utilizing limited annotated data effectively to improve spoken QA performance. To deal with data scarcity, we train a question-answer generation model with annotated data and then produce large amounts of question-answer pairs from unannotated data (paragraphs). Our experiments demonstrate that incorporating limited annotated data and the automatically generated data through a carefully selected fine-tuning strategy leads to 5.5% relative F1 gain over the model trained only with annotated data. Moreover, the proposed framework is also effective in high ASR errors.

pdf bib abs

Overview of the Hate Speech Detection in Turkish and Arabic Tweets (HSD-2Lang) Shared Task at CASE 2024
Gökçe Uludoğan | Somaiyeh Dehghan | Inanc Arin | Elif Erol | Berrin Yanikoglu | Arzucan Özgür
Proceedings of the 7th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2024)

This paper offers an overview of Hate Speech Detection in Turkish and Arabic Tweets (HSD-2Lang) Shared Task at CASE workshop to be held jointly with EACL 2024. The task was divided into two subtasks: Subtask A, targeting hate speech detection in various Turkish contexts, and Subtask B, addressing hate speech detection in Arabic with limited data. The shared task attracted significant attention with 33 teams that registered and 10 teams that participated in at least one task. In this paper, we provide the details of the tasks and the approaches adopted by the participant along with an analysis of the results obtained from this shared task.

2023

pdf bib abs

TULAP - An Accessible and Sustainable Platform for Turkish Natural Language Processing Resources
Susan Uskudarli | Muhammet Şen | Furkan Akkurt | Merve Gürbüz | Onur Gungor | Arzucan Özgür | Tunga Güngör
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Access to natural language processing resources is essential for their continuous improvement. This can be especially challenging in educational institutions where the software development effort required to package and release research outcomes may be overwhelming and under-recognized. Access towell-prepared and reliable research outcomes is important both for their developers as well as the greater research community. This paper presents an approach to address this concern with two main goals: (1) to create an open-source easily deployable platform where resources can be easily shared and explored, and (2) to use this platform to publish open-source Turkish NLP resources (datasets and tools) created by a research lab. The Turkish Natural Language Processing (TULAP) was designed and developed as an easy-to-use platform to share dataset and tool resources which supports interactive tool demos. Numerous open access Turkish NLP resources have been shared on TULAP. All tools are containerized to support portability for custom use. This paper describes the design, implementation, and deployment of TULAP with use cases (available at https://tulap.cmpe.boun.edu.tr/). A short video demonstrating our system is available at https://figshare.com/articles/media/TULAP_Demo/22179047.

pdf bib abs

A Dataset for Investigating the Impact of Context for Offensive Language Detection in Tweets
Musa İhtiyar | Ömer Özdemir | Mustafa Erengül | Arzucan Özgür
Findings of the Association for Computational Linguistics: EMNLP 2023

Offensive language detection is crucial in natural language processing (NLP). We investigated the importance of context for detecting such language in reply tweets on Twitter, where the use of offensive language is widespread. We collected a Turkish tweet dataset where the target group was unvaccinated people during the Covid period. Tweets in the dataset were enriched with contextual information by adding the original tweet to which a particular tweet was posted as a reply. The dataset, which includes over 28,000 tweet-reply pairs, was manually labeled by human annotators and made publicly available. In addition, we compared the performance of different machine learning models with and without contextual information. Our results show that this type of contextual information was not very useful in improving the performance of the models in general, although it slightly increased the macro-averaged F1-score of certain models.

2022

pdf bib abs

A Dataset and BERT-based Models for Targeted Sentiment Analysis on Turkish Texts
Mustafa Melih Mutlu | Arzucan Özgür
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Targeted Sentiment Analysis aims to extract sentiment towards a particular target from a given text. It is a field that is attracting attention due to the increasing accessibility of the Internet, which leads people to generate an enormous amount of data. Sentiment analysis, which in general requires annotated data for training, is a well-researched area for widely studied languages such as English. For low-resource languages such as Turkish, there is a lack of such annotated data. We present an annotated Turkish dataset suitable for targeted sentiment analysis. We also propose BERT-based models with different architectures to accomplish the task of targeted sentiment analysis. The results demonstrate that the proposed models outperform the traditional sentiment analysis models for the targeted sentiment analysis task.

pdf bib abs

A Framework for Automatic Generation of Spoken Question-Answering Data
Merve Ünlü Menevşe | Yusufcan Manav | Ebru Arisoy | Arzucan Özgür
Findings of the Association for Computational Linguistics: EMNLP 2022

This paper describes a framework to automatically generate a spoken question answering (QA) dataset. The framework consists of a question generation (QG) module to generate questions automatically from given text documents, a text-to-speech (TTS) module to convert the text documents into spoken form and an automatic speech recognition (ASR) module to transcribe the spoken content. The final dataset contains question-answer pairs for both the reference text and ASR transcriptions as well as the audio files corresponding to each reference text. For QG and ASR systems we used pre-trained multilingual encoder-decoder transformer models and fine-tuned these models using a limited amount of manually generated QA data and TTS-based speech data, respectively. As a proof of concept, we investigated the proposed framework for Turkish and generated the Turkish Question Answering (TurQuAse) dataset using Wikipedia articles. Manual evaluation of the automatically generated question- answer pairs and QA performance evaluation with state of-the-art models on TurQuAse show that the proposed framework is efficient for automatically generating spoken QA datasets. To the best of our knowledge, TurQuAse is the first publicly available spoken question answering dataset for Turkish. The proposed framework can be easily extended to other languages where a limited amount of QA data is available.

pdf bib abs

Identifying Hate Speech Using Neural Networks and Discourse Analysis Techniques
Zehra Melce Hüsünbeyi | Didar Akar | Arzucan Özgür
Proceedings of the First Workshop on Language Technology and Resources for a Fair, Inclusive, and Safe Society within the 13th Language Resources and Evaluation Conference

Discriminatory language, in particular hate speech, is a global problem posing a grave threat to democracy and human rights. Yet, it is not always easy to identify, as it is rarely explicit. In order to detect hate speech, we developed Hierarchical Attention Network (HAN) based and Bidirectional Encoder Representations from Transformer (BERT) based deep learning models to capture the changing discursive cues and understand the context around the discourse. In addition, we designed linguistic features using critical discourse analysis techniques and integrated them into these neural network models. We studied the compatibility of our model with the hate speech detection problem by comparing it with traditional machine learning models, as well as a Convolution Neural Network (CNN) based model, a Convolutional Neural Network-Gated Recurrent Unit (CNN-GRU) based model which reached significant performance results for hate speech detection. Our results on a manually annotated corpus of print media in Turkish show that the proposed approach is effective for hate speech detection. We believe that the feature sets created for the Turkish language will encourage new studies in the quantitative analysis of hate speech.

pdf bib abs

Improving Code-Switching Dependency Parsing with Semi-Supervised Auxiliary Tasks
Şaziye Betül Özateş | Arzucan Özgür | Tunga Gungor | Özlem Çetinoğlu
Findings of the Association for Computational Linguistics: NAACL 2022

Code-switching dependency parsing stands as a challenging task due to both the scarcity of necessary resources and the structural difficulties embedded in code-switched languages. In this study, we introduce novel sequence labeling models to be used as auxiliary tasks for dependency parsing of code-switched text in a semi-supervised scheme. We show that using auxiliary tasks enhances the performance of an LSTM-based dependency parsing model and leads to better results compared to an XLM-R-based model with significantly less computational and time complexity. As the first study that focuses on multiple code-switching language pairs for dependency parsing, we acquire state-of-the-art scores on all of the studied languages. Our best models outperform the previous work by 7.4 LAS points on average.

2021

pdf bib abs

BOUN at SemEval-2021 Task 9: Text Augmentation Techniques for Fact Verification in Tabular Data
Abdullatif Köksal | Yusuf Yüksel | Bekir Yıldırım | Arzucan Özgür
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

In this paper, we present our text augmentation based approach for the Table Statement Support Subtask (Phase A) of SemEval-2021 Task 9. We experiment with different text augmentation techniques such as back translation and synonym swapping using Word2Vec and WordNet. We show that text augmentation techniques lead to 2.5% improvement in F1 on the test set. Further, we investigate the impact of domain adaptation and joint learning on fact verification in tabular data by utilizing the SemTabFacts and TabFact datasets. We observe that joint learning improves the F1 scores on the SemTabFacts and TabFact test sets by 3.31% and 0.77%, respectively.

pdf bib abs

Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution
Yi Huang | Buse Giledereli | Abdullatif Köksal | Arzucan Özgür | Elif Ozkirimli
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Multi-label text classification is a challenging task because it requires capturing label dependencies. It becomes even more challenging when class distribution is long-tailed. Resampling and re-weighting are common approaches used for addressing the class imbalance problem, however, they are not effective when there is label dependency besides class imbalance because they result in oversampling of common labels. Here, we introduce the application of balancing loss functions for multi-label text classification. We perform experiments on a general domain dataset with 90 labels (Reuters-21578) and a domain-specific dataset from PubMed with 18211 labels. We find that a distribution-balanced loss function, which inherently addresses both the class imbalance and label linkage problems, outperforms commonly used loss functions. Distribution balancing methods have been successfully used in the image recognition field. Here, we show their effectiveness in natural language processing. Source code is available at https://github.com/blessu/BalancedLossNLP.

pdf bib abs

Overcoming the challenges in morphological annotation of Turkish in universal dependencies framework
Talha Bedir | Karahan Şahin | Onur Gungor | Suzan Uskudarli | Arzucan Özgür | Tunga Güngör | Balkiz Ozturk Basaran
Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop

This paper presents several challenges faced when annotating Turkish treebanks in accordance with the Universal Dependencies (UD) guidelines and proposes solutions to address them. Most of these challenges stem from the lack of adequate support in the UD framework to accurately represent null morphemes and complex derivations, which results in a significant loss of information for Turkish. This loss negatively impacts the tools that are developed based on these treebanks. We raised and discussed these issues within the community on the official UD portal. This paper presents these issues and our proposals to more accurately represent morphosyntactic information for Turkish while adhering to guidelines of UD. This work aims to contribute to the representation of Turkish and other agglutinative languages in UD-based treebanks, which in turn aids to develop more accurately annotated datasets for such languages.

2020

pdf bib abs

Analyzing ELMo and DistilBERT on Socio-political News Classification
Berfu Büyüköz | Ali Hürriyetoğlu | Arzucan Özgür
Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020

This study evaluates the robustness of two state-of-the-art deep contextual language representations, ELMo and DistilBERT, on supervised learning of binary protest news classification (PC) and sentiment analysis (SA) of product reviews. A ”cross-context” setting is enabled using test sets that are distinct from the training data. The models are fine-tuned and fed into a Feed-Forward Neural Network (FFNN) and a Bidirectional Long Short Term Memory network (BiLSTM). Multinomial Naive Bayes (MNB) and Linear Support Vector Machine (LSVM) are used as traditional baselines. The results suggest that DistilBERT can transfer generic semantic knowledge to other domains better than ELMo. DistilBERT is also 30% smaller and 83% faster than ELMo, which suggests superiority for smaller computational training budgets. When generalization is not the utmost preference and test domain is similar to the training domain, the traditional machine learning (ML) algorithms can still be considered as more economic alternatives to deep language representations.

pdf bib abs

The RELX Dataset and Matching the Multilingual Blanks for Cross-Lingual Relation Classification
Abdullatif Köksal | Arzucan Özgür
Findings of the Association for Computational Linguistics: EMNLP 2020

Relation classification is one of the key topics in information extraction, which can be used to construct knowledge bases or to provide useful information for question answering. Current approaches for relation classification are mainly focused on the English language and require lots of training data with human annotations. Creating and annotating a large amount of training data for low-resource languages is impractical and expensive. To overcome this issue, we propose two cross-lingual relation classification models: a baseline model based on Multilingual BERT and a new multilingual pretraining setup, which significantly improves the baseline with distant supervision. For evaluation, we introduce a new public benchmark dataset for cross-lingual relation classification in English, French, German, Spanish, and Turkish, called RELX. We also provide the RELX-Distant dataset, which includes hundreds of thousands of sentences with relations from Wikipedia and Wikidata collected by distant supervision for these languages. Our code and data are available at: https://github.com/boun-tabi/RELX

pdf bib abs

Vapur: A Search Engine to Find Related Protein - Compound Pairs in COVID-19 Literature
Abdullatif Köksal | Hilal Dönmez | Rıza Özçelik | Elif Ozkirimli | Arzucan Özgür
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

Coronavirus Disease of 2019 (COVID-19) created dire consequences globally and triggered an intense scientific effort from different domains. The resulting publications created a huge text collection in which finding the studies related to a biomolecule of interest is challenging for general purpose search engines because the publications are rich in domain specific terminology. Here, we present Vapur: an online COVID-19 search engine specifically designed to find related protein - chemical pairs. Vapur is empowered with a relation-oriented inverted index that is able to retrieve and group studies for a query biomolecule with respect to its related entities. The inverted index of Vapur is automatically created with a BioNLP pipeline and integrated with an online user interface. The online interface is designed for the smooth traversal of the current literature by domain researchers and is publicly available at https://tabilab.cmpe.boun.edu.tr/vapur/.

2019

pdf bib abs

In this paper, we present the current version of two different treebanks, the re-annotation of the Turkish PUD Treebank and the first annotation of the Turkish National Corpus Universal Dependency (henceforth TNC-UD). The annotation of both treebanks, the Turkish PUD Treebank and TNC-UD, was carried out based on the decisions concerning linguistic adequacy of re-annotation of the Turkish IMST-UD Treebank (Türk et. al., forthcoming). Both of the treebanks were annotated with the same annotation process and morphological and syntactic analyses. The TNC-UD is planned to have 10,000 sentences. In this paper, we will present the first 500 sentences along with the annotation PUD Treebank. Moreover, this paper also offers the parsing results of a graph-based neural parser on the previous and re-annotated PUD, as well as the TNC-UD. In light of the comparisons, even though we observe a slight decrease in the attachment scores of the Turkish PUD treebank, we demonstrate that the annotation of the TNC-UD improves the parsing accuracy of Turkish. In addition to the treebanks, we have also constructed a custom annotation software with advanced filtering and morphological editing options. Both the treebanks, including a full edit-history and the annotation guidelines, and the custom software are publicly available under an open license online.

pdf bib

pdf bib abs

Turkish Tweet Classification with Transformer Encoder
Atıf Emre Yüksel | Yaşar Alim Türkmen | Arzucan Özgür | Berna Altınel
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Short-text classification is a challenging task, due to the sparsity and high dimensionality of the feature space. In this study, we aim to analyze and classify Turkish tweets based on their topics. Social media jargon and the agglutinative structure of the Turkish language makes this classification task even harder. As far as we know, this is the first study that uses a Transformer Encoder for short text classification in Turkish. The model is trained in a weakly supervised manner, where the training data set has been labeled automatically. Our results on the test set, which has been manually labeled, show that performing morphological analysis improves the classification performance of the traditional machine learning algorithms Random Forest, Naive Bayes, and Support Vector Machines. Still, the proposed approach achieves an F-score of 89.3 % outperforming those algorithms by at least 5 points.

pdf bib abs

BOUN-ISIK Participation: An Unsupervised Approach for the Named Entity Normalization and Relation Extraction of Bacteria Biotopes
İlknur Karadeniz | Ömer Faruk Tuna | Arzucan Özgür
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

This paper presents our participation to the Bacteria Biotope Task of the BioNLP Shared Task 2019. Our participation includes two systems for the two subtasks of the Bacteria Biotope Task: the normalization of entities (BB-norm) and the identification of the relations between the entities given a biomedical text (BB-rel). For the normalization of entities, we utilized word embeddings and syntactic re-ranking. For the relation extraction task, pre-defined rules are used. Although both approaches are unsupervised, in the sense that they do not need any labeled data, they achieved promising results. Especially, for the BB-norm task, the results have shown that the proposed method performs as good as deep learning based methods, which require labeled data.

2018

pdf bib abs

A Morphology-Based Representation Model for LSTM-Based Dependency Parsing of Agglutinative Languages
Şaziye Betül Özateş | Arzucan Özgür | Tunga Güngör | Balkız Öztürk
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We propose two word representation models for agglutinative languages that better capture the similarities between words which have similar tasks in sentences. Our models highlight the morphological features in words and embed morphological information into their dense representations. We have tested our models on an LSTM-based dependency parser with character-based word embeddings proposed by Ballesteros et al. (2015). We participated in the CoNLL 2018 Shared Task on multilingual parsing from raw text to universal dependencies as the BOUN team. We show that our morphology-based embedding models improve the parsing performance for most of the agglutinative languages.

2017

pdf bib abs

BUSEM at SemEval-2017 Task 4A Sentiment Analysis with Word Embedding and Long Short Term Memory RNN Approaches
Deger Ayata | Murat Saraclar | Arzucan Ozgur
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes our approach for SemEval-2017 Task 4: Sentiment Analysis in Twitter. We have participated in Subtask A: Message Polarity Classification subtask and developed two systems. The first system uses word embeddings for feature representation and Support Vector Machine, Random Forest and Naive Bayes algorithms for classification of Twitter messages into negative, neutral and positive polarity. The second system is based on Long Short Term Memory Recurrent Neural Networks and uses word indexes as sequence of inputs for feature representation.

2016

pdf bib

Ontology-Based Categorization of Bacteria and Habitat Entities using Information Retrieval Techniques
Mert Tiftikci | Hakan Şahin | Berfu Büyüköz | Alper Yayıkçı | Arzucan Özgür
Proceedings of the 4th BioNLP Shared Task Workshop

pdf bib abs

Sentence Similarity based on Dependency Tree Kernels for Multi-document Summarization
Şaziye Betül Özateş | Arzucan Özgür | Dragomir Radev
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We introduce an approach based on using the dependency grammar representations of sentences to compute sentence similarity for extractive multi-document summarization. We adapt and investigate the effects of two untyped dependency tree kernels, which have originally been proposed for relation extraction, to the multi-document summarization problem. In addition, we propose a series of novel dependency grammar based kernels to better represent the syntactic and semantic similarities among the sentences. The proposed methods incorporate the type information of the dependency relations for sentence similarity calculation. To our knowledge, this is the first study that investigates using dependency tree based sentence similarity for multi-document summarization.

pdf bib abs

Named Entity Recognition on Twitter for Turkish using Semi-supervised Learning with Word Embeddings
Eda Okur | Hakan Demir | Arzucan Özgür
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Recently, due to the increasing popularity of social media, the necessity for extracting information from informal text types, such as microblog texts, has gained significant attention. In this study, we focused on the Named Entity Recognition (NER) problem on informal text types for Turkish. We utilized a semi-supervised learning approach based on neural networks. We applied a fast unsupervised method for learning continuous representations of words in vector space. We made use of these obtained word embeddings, together with language independent features that are engineered to work better on informal text types, for generating a Turkish NER system on microblog texts. We evaluated our Turkish NER system on Twitter messages and achieved better F-score performances than the published results of previously proposed NER systems on Turkish tweets. Since we did not employ any language dependent features, we believe that our method can be easily adapted to microblog texts in other morphologically rich languages.

pdf bib

Towards Building a Political Protest Database to Explain Changes in the Welfare State
Çağıl Sönmez | Arzucan Özgür | Erdem Yörük
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib abs

Segmenting Hashtags using Automatically Created Training Data
Arda Çelebi | Arzucan Özgür
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Hashtags, which are commonly composed of multiple words, are increasingly used to convey the actual messages in tweets. Understanding what tweets are saying is getting more dependent on understanding hashtags. Therefore, identifying the individual words that constitute a hashtag is an important, yet a challenging task due to the abrupt nature of the language used in tweets. In this study, we introduce a feature-rich approach based on using supervised machine learning methods to segment hashtags. Our approach is unsupervised in the sense that instead of using manually segmented hashtags for training the machine learning classifiers, we automatically create our training data by using tweets as well as by automatically extracting hashtag segmentations from a large corpus. We achieve promising results with such automatically created noisy training data.

2014

pdf bib abs

Self-training a Constituency Parser using n-gram Trees
Arda Çelebi | Arzucan Özgür
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this study, we tackle the problem of self-training a feature-rich discriminative constituency parser. We approach the self-training problem with the assumption that while the full sentence parse tree produced by a parser may contain errors, some portions of it are more likely to be correct. We hypothesize that instead of feeding the parser the guessed full sentence parse trees of its own, we can break them down into smaller ones, namely n-gram trees, and perform self-training on them. We build an n-gram parser and transfer the distinct expertise of the n-gram parser to the full sentence parser by using the Hierarchical Joint Learning (HJL) approach. The resulting jointly self-trained parser obtains slight improvement over the baseline.

pdf bib abs

Expanding machine translation training data with an out-of-domain corpus using language modeling based vocabulary saturation
Burak Aydın | Arzucan Özgür
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

The training data size is of utmost importance for statistical machine translation (SMT), since it affects the training time, model size, decoding speed, as well as the system’s overall success. One of the challenges for developing SMT systems for languages with less resources is the limited sizes of the available training data. In this paper, we propose an approach for expanding the training data by including parallel texts from an out-of-domain corpus. Selecting the best out-of-domain sentences for inclusion in the training set is important for the overall performance of the system. Our method is based on first ranking the out-of-domain sentences using a language modeling approach, and then, including the sentences to the training set by using the vocabulary saturation filter technique. We evaluated our approach for the English-Turkish language pair and obtained promising results. Performance improvements of up to +0.8 BLEU points for the English-Turkish translation system are achieved. We compared our results with the translation model combination approaches as well and reported the improvements. Moreover, we implemented our system with dependency parse tree based language modeling in addition to the n-gram based language modeling and reported comparable results.

pdf bib

A Graph-based Approach for Contextual Text Normalization
Cagil Sönmez | Arzucan Özgür
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib

Analyzing Stemming Approaches for Turkish Multi-Document Summarization
Muhammed Yavuz Nuzumlalı | Arzucan Özgür
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)