Arzucan Özgür

Also published as: Arzucan Ozgur


2021

pdf bib
Balancing Methods for Multi-label Text Classification with Long-Tailed Class Distribution
Yi Huang | Buse Giledereli | Abdullatif Köksal | Arzucan Özgür | Elif Ozkirimli
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Multi-label text classification is a challenging task because it requires capturing label dependencies. It becomes even more challenging when class distribution is long-tailed. Resampling and re-weighting are common approaches used for addressing the class imbalance problem, however, they are not effective when there is label dependency besides class imbalance because they result in oversampling of common labels. Here, we introduce the application of balancing loss functions for multi-label text classification. We perform experiments on a general domain dataset with 90 labels (Reuters-21578) and a domain-specific dataset from PubMed with 18211 labels. We find that a distribution-balanced loss function, which inherently addresses both the class imbalance and label linkage problems, outperforms commonly used loss functions. Distribution balancing methods have been successfully used in the image recognition field. Here, we show their effectiveness in natural language processing. Source code is available at https://github.com/blessu/BalancedLossNLP.

pdf bib
Overcoming the challenges in morphological annotation of Turkish in universal dependencies framework
Talha Bedir | Karahan Şahin | Onur Gungor | Suzan Uskudarli | Arzucan Özgür | Tunga Güngör | Balkiz Ozturk Basaran
Proceedings of The Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop

This paper presents several challenges faced when annotating Turkish treebanks in accordance with the Universal Dependencies (UD) guidelines and proposes solutions to address them. Most of these challenges stem from the lack of adequate support in the UD framework to accurately represent null morphemes and complex derivations, which results in a significant loss of information for Turkish. This loss negatively impacts the tools that are developed based on these treebanks. We raised and discussed these issues within the community on the official UD portal. This paper presents these issues and our proposals to more accurately represent morphosyntactic information for Turkish while adhering to guidelines of UD. This work aims to contribute to the representation of Turkish and other agglutinative languages in UD-based treebanks, which in turn aids to develop more accurately annotated datasets for such languages.

pdf bib
BOUN at SemEval-2021 Task 9: Text Augmentation Techniques for Fact Verification in Tabular Data
Abdullatif Köksal | Yusuf Yüksel | Bekir Yıldırım | Arzucan Özgür
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

In this paper, we present our text augmentation based approach for the Table Statement Support Subtask (Phase A) of SemEval-2021 Task 9. We experiment with different text augmentation techniques such as back translation and synonym swapping using Word2Vec and WordNet. We show that text augmentation techniques lead to 2.5% improvement in F1 on the test set. Further, we investigate the impact of domain adaptation and joint learning on fact verification in tabular data by utilizing the SemTabFacts and TabFact datasets. We observe that joint learning improves the F1 scores on the SemTabFacts and TabFact test sets by 3.31% and 0.77%, respectively.

2020

pdf bib
Vapur: A Search Engine to Find Related Protein - Compound Pairs in COVID-19 Literature
Abdullatif Köksal | Hilal Dönmez | Rıza Özçelik | Elif Ozkirimli | Arzucan Özgür
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

Coronavirus Disease of 2019 (COVID-19) created dire consequences globally and triggered an intense scientific effort from different domains. The resulting publications created a huge text collection in which finding the studies related to a biomolecule of interest is challenging for general purpose search engines because the publications are rich in domain specific terminology. Here, we present Vapur: an online COVID-19 search engine specifically designed to find related protein - chemical pairs. Vapur is empowered with a relation-oriented inverted index that is able to retrieve and group studies for a query biomolecule with respect to its related entities. The inverted index of Vapur is automatically created with a BioNLP pipeline and integrated with an online user interface. The online interface is designed for the smooth traversal of the current literature by domain researchers and is publicly available at https://tabilab.cmpe.boun.edu.tr/vapur/.

pdf bib
The RELX Dataset and Matching the Multilingual Blanks for Cross-Lingual Relation Classification
Abdullatif Köksal | Arzucan Özgür
Findings of the Association for Computational Linguistics: EMNLP 2020

Relation classification is one of the key topics in information extraction, which can be used to construct knowledge bases or to provide useful information for question answering. Current approaches for relation classification are mainly focused on the English language and require lots of training data with human annotations. Creating and annotating a large amount of training data for low-resource languages is impractical and expensive. To overcome this issue, we propose two cross-lingual relation classification models: a baseline model based on Multilingual BERT and a new multilingual pretraining setup, which significantly improves the baseline with distant supervision. For evaluation, we introduce a new public benchmark dataset for cross-lingual relation classification in English, French, German, Spanish, and Turkish, called RELX. We also provide the RELX-Distant dataset, which includes hundreds of thousands of sentences with relations from Wikipedia and Wikidata collected by distant supervision for these languages. Our code and data are available at: https://github.com/boun-tabi/RELX

pdf bib
Analyzing ELMo and DistilBERT on Socio-political News Classification
Berfu Büyüköz | Ali Hürriyetoğlu | Arzucan Özgür
Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020

This study evaluates the robustness of two state-of-the-art deep contextual language representations, ELMo and DistilBERT, on supervised learning of binary protest news classification (PC) and sentiment analysis (SA) of product reviews. A ”cross-context” setting is enabled using test sets that are distinct from the training data. The models are fine-tuned and fed into a Feed-Forward Neural Network (FFNN) and a Bidirectional Long Short Term Memory network (BiLSTM). Multinomial Naive Bayes (MNB) and Linear Support Vector Machine (LSVM) are used as traditional baselines. The results suggest that DistilBERT can transfer generic semantic knowledge to other domains better than ELMo. DistilBERT is also 30% smaller and 83% faster than ELMo, which suggests superiority for smaller computational training budgets. When generalization is not the utmost preference and test domain is similar to the training domain, the traditional machine learning (ML) algorithms can still be considered as more economic alternatives to deep language representations.

2019

pdf bib
BOUN-ISIK Participation: An Unsupervised Approach for the Named Entity Normalization and Relation Extraction of Bacteria Biotopes
İlknur Karadeniz | Ömer Faruk Tuna | Arzucan Özgür
Proceedings of The 5th Workshop on BioNLP Open Shared Tasks

This paper presents our participation to the Bacteria Biotope Task of the BioNLP Shared Task 2019. Our participation includes two systems for the two subtasks of the Bacteria Biotope Task: the normalization of entities (BB-norm) and the identification of the relations between the entities given a biomedical text (BB-rel). For the normalization of entities, we utilized word embeddings and syntactic re-ranking. For the relation extraction task, pre-defined rules are used. Although both approaches are unsupervised, in the sense that they do not need any labeled data, they achieved promising results. Especially, for the BB-norm task, the results have shown that the proposed method performs as good as deep learning based methods, which require labeled data.

pdf bib
Turkish Tweet Classification with Transformer Encoder
Atıf Emre Yüksel | Yaşar Alim Türkmen | Arzucan Özgür | Berna Altınel
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Short-text classification is a challenging task, due to the sparsity and high dimensionality of the feature space. In this study, we aim to analyze and classify Turkish tweets based on their topics. Social media jargon and the agglutinative structure of the Turkish language makes this classification task even harder. As far as we know, this is the first study that uses a Transformer Encoder for short text classification in Turkish. The model is trained in a weakly supervised manner, where the training data set has been labeled automatically. Our results on the test set, which has been manually labeled, show that performing morphological analysis improves the classification performance of the traditional machine learning algorithms Random Forest, Naive Bayes, and Support Vector Machines. Still, the proposed approach achieves an F-score of 89.3 % outperforming those algorithms by at least 5 points.

pdf bib
Turkish Treebanking: Unifying and Constructing Efforts
Utku Türk | Furkan Atmaca | Şaziye Betül Özateş | Abdullatif Köksal | Balkiz Ozturk Basaran | Tunga Gungor | Arzucan Özgür
Proceedings of the 13th Linguistic Annotation Workshop

In this paper, we present the current version of two different treebanks, the re-annotation of the Turkish PUD Treebank and the first annotation of the Turkish National Corpus Universal Dependency (henceforth TNC-UD). The annotation of both treebanks, the Turkish PUD Treebank and TNC-UD, was carried out based on the decisions concerning linguistic adequacy of re-annotation of the Turkish IMST-UD Treebank (Türk et. al., forthcoming). Both of the treebanks were annotated with the same annotation process and morphological and syntactic analyses. The TNC-UD is planned to have 10,000 sentences. In this paper, we will present the first 500 sentences along with the annotation PUD Treebank. Moreover, this paper also offers the parsing results of a graph-based neural parser on the previous and re-annotated PUD, as well as the TNC-UD. In light of the comparisons, even though we observe a slight decrease in the attachment scores of the Turkish PUD treebank, we demonstrate that the annotation of the TNC-UD improves the parsing accuracy of Turkish. In addition to the treebanks, we have also constructed a custom annotation software with advanced filtering and morphological editing options. Both the treebanks, including a full edit-history and the annotation guidelines, and the custom software are publicly available under an open license online.

pdf bib
Improving the Annotations in the Turkish Universal Dependency Treebank
Utku Türk | Furkan Atmaca | Şaziye Betül Özateş | Balkız Öztürk Başaran | Tunga Güngör | Arzucan Özgür
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

2018

pdf bib
A Morphology-Based Representation Model for LSTM-Based Dependency Parsing of Agglutinative Languages
Şaziye Betül Özateş | Arzucan Özgür | Tunga Güngör | Balkız Öztürk
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We propose two word representation models for agglutinative languages that better capture the similarities between words which have similar tasks in sentences. Our models highlight the morphological features in words and embed morphological information into their dense representations. We have tested our models on an LSTM-based dependency parser with character-based word embeddings proposed by Ballesteros et al. (2015). We participated in the CoNLL 2018 Shared Task on multilingual parsing from raw text to universal dependencies as the BOUN team. We show that our morphology-based embedding models improve the parsing performance for most of the agglutinative languages.

2017

pdf bib
BUSEM at SemEval-2017 Task 4A Sentiment Analysis with Word Embedding and Long Short Term Memory RNN Approaches
Deger Ayata | Murat Saraclar | Arzucan Ozgur
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes our approach for SemEval-2017 Task 4: Sentiment Analysis in Twitter. We have participated in Subtask A: Message Polarity Classification subtask and developed two systems. The first system uses word embeddings for feature representation and Support Vector Machine, Random Forest and Naive Bayes algorithms for classification of Twitter messages into negative, neutral and positive polarity. The second system is based on Long Short Term Memory Recurrent Neural Networks and uses word indexes as sequence of inputs for feature representation.

2016

pdf bib
Towards Building a Political Protest Database to Explain Changes in the Welfare State
Çağıl Sönmez | Arzucan Özgür | Erdem Yörük
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Ontology-Based Categorization of Bacteria and Habitat Entities using Information Retrieval Techniques
Mert Tiftikci | Hakan Şahin | Berfu Büyüköz | Alper Yayıkçı | Arzucan Özgür
Proceedings of the 4th BioNLP Shared Task Workshop

pdf bib
Named Entity Recognition on Twitter for Turkish using Semi-supervised Learning with Word Embeddings
Eda Okur | Hakan Demir | Arzucan Özgür
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Recently, due to the increasing popularity of social media, the necessity for extracting information from informal text types, such as microblog texts, has gained significant attention. In this study, we focused on the Named Entity Recognition (NER) problem on informal text types for Turkish. We utilized a semi-supervised learning approach based on neural networks. We applied a fast unsupervised method for learning continuous representations of words in vector space. We made use of these obtained word embeddings, together with language independent features that are engineered to work better on informal text types, for generating a Turkish NER system on microblog texts. We evaluated our Turkish NER system on Twitter messages and achieved better F-score performances than the published results of previously proposed NER systems on Turkish tweets. Since we did not employ any language dependent features, we believe that our method can be easily adapted to microblog texts in other morphologically rich languages.

pdf bib
Sentence Similarity based on Dependency Tree Kernels for Multi-document Summarization
Şaziye Betül Özateş | Arzucan Özgür | Dragomir Radev
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We introduce an approach based on using the dependency grammar representations of sentences to compute sentence similarity for extractive multi-document summarization. We adapt and investigate the effects of two untyped dependency tree kernels, which have originally been proposed for relation extraction, to the multi-document summarization problem. In addition, we propose a series of novel dependency grammar based kernels to better represent the syntactic and semantic similarities among the sentences. The proposed methods incorporate the type information of the dependency relations for sentence similarity calculation. To our knowledge, this is the first study that investigates using dependency tree based sentence similarity for multi-document summarization.

pdf bib
Segmenting Hashtags using Automatically Created Training Data
Arda Çelebi | Arzucan Özgür
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Hashtags, which are commonly composed of multiple words, are increasingly used to convey the actual messages in tweets. Understanding what tweets are saying is getting more dependent on understanding hashtags. Therefore, identifying the individual words that constitute a hashtag is an important, yet a challenging task due to the abrupt nature of the language used in tweets. In this study, we introduce a feature-rich approach based on using supervised machine learning methods to segment hashtags. Our approach is unsupervised in the sense that instead of using manually segmented hashtags for training the machine learning classifiers, we automatically create our training data by using tweets as well as by automatically extracting hashtag segmentations from a large corpus. We achieve promising results with such automatically created noisy training data.

2014

pdf bib
A Graph-based Approach for Contextual Text Normalization
Cagil Sönmez | Arzucan Özgür
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Analyzing Stemming Approaches for Turkish Multi-Document Summarization
Muhammed Yavuz Nuzumlalı | Arzucan Özgür
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Expanding machine translation training data with an out-of-domain corpus using language modeling based vocabulary saturation
Burak Aydın | Arzucan Özgür
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

The training data size is of utmost importance for statistical machine translation (SMT), since it affects the training time, model size, decoding speed, as well as the system’s overall success. One of the challenges for developing SMT systems for languages with less resources is the limited sizes of the available training data. In this paper, we propose an approach for expanding the training data by including parallel texts from an out-of-domain corpus. Selecting the best out-of-domain sentences for inclusion in the training set is important for the overall performance of the system. Our method is based on first ranking the out-of-domain sentences using a language modeling approach, and then, including the sentences to the training set by using the vocabulary saturation filter technique. We evaluated our approach for the English-Turkish language pair and obtained promising results. Performance improvements of up to +0.8 BLEU points for the English-Turkish translation system are achieved. We compared our results with the translation model combination approaches as well and reported the improvements. Moreover, we implemented our system with dependency parse tree based language modeling in addition to the n-gram based language modeling and reported comparable results.

pdf bib
Self-training a Constituency Parser using n-gram Trees
Arda Çelebi | Arzucan Özgür
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this study, we tackle the problem of self-training a feature-rich discriminative constituency parser. We approach the self-training problem with the assumption that while the full sentence parse tree produced by a parser may contain errors, some portions of it are more likely to be correct. We hypothesize that instead of feeding the parser the guessed full sentence parse trees of its own, we can break them down into smaller ones, namely n-gram trees, and perform self-training on them. We build an n-gram parser and transfer the distinct expertise of the $n$-gram parser to the full sentence parser by using the Hierarchical Joint Learning (HJL) approach. The resulting jointly self-trained parser obtains slight improvement over the baseline.

2013

pdf bib
BOUNCE: Sentiment Classification in Twitter using Rich Feature Sets
Nadin Kökciyan | Arda Çelebi | Arzucan Özgür | Suzan Üsküdarlı
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
Bacteria Biotope Detection, Ontology-based Normalization, and Relation Extraction using Syntactic Rules
İlknur Karadeniz | Arzucan Özgür
Proceedings of the BioNLP Shared Task 2013 Workshop

2010

pdf bib
Citation Summarization Through Keyphrase Extraction
Vahed Qazvinian | Dragomir R. Radev | Arzucan Özgür
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

2009

pdf bib
Detecting Speculations and their Scopes in Scientific Text
Arzucan Özgür | Dragomir R. Radev
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Supervised Classification for Extracting Biomedical Events
Arzucan Özgür | Dragomir Radev
Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task

2007

pdf bib
Semi-Supervised Classification for Extracting Protein Interaction Sentences using Dependency Parsing
Güneş Erkan | Arzucan Özgür | Dragomir R. Radev
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)