Mona Diab

Also published as: Mona T. Diab


2021

pdf bib
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
Thamar Solorio | Shuguang Chen | Alan W. Black | Mona Diab | Sunayana Sitaram | Victor Soto | Emre Yilmaz | Anirudh Srinivasan
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

pdf bib
Detecting Hallucinated Content in Conditional Neural Sequence Generation
Chunting Zhou | Graham Neubig | Jiatao Gu | Mona Diab | Francisco Guzmán | Luke Zettlemoyer | Marjan Ghazvininejad
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Active Learning for Rumor Identification on Social Media
Parsa Farinneya | Mohammad Mahdi Abdollah Pour | Sardar Hamidian | Mona Diab
Findings of the Association for Computational Linguistics: EMNLP 2021

Social media has emerged as a key channel for seeking information. Online users spend several hours reading, posting, and searching for news on microblogging platforms daily. However, this could act as a double-edged sword especially when not all information online is reliable. Moreover, the inherently unmoderated nature of social media renders identifying unverified information ever more challenging. Most of the existing approaches for rumor tracking are not scalable because of their dependency on a significant amount of labeled data. In this work, we investigate this problem from different angles. We design an Active-Transfer Learning (ATL) strategy to identify rumors with a limited amount of annotated data. We go beyond that and investigate the impact of leveraging various machine learning approaches in addition to different contextual representations. We discuss the impact of multiple classifiers on a limited amount of annotated data followed by an interactive approach to gradually update the models by adding the least certain samples (LCS) from the pool of unlabeled data. Our proposed Active Learning (AL) strategy achieves faster convergence in terms of the F-score while requiring fewer annotated samples (42% of the whole dataset for the best model).

pdf bib
Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data
Wei-Jen Ko | Ahmed El-Kishky | Adithya Renduchintala | Vishrav Chaudhary | Naman Goyal | Francisco Guzmán | Pascale Fung | Philipp Koehn | Mona Diab
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data, in addition to any parallel data in the related high-resource language. Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation. We experiment on 7 languages from three different language families and show that our technique significantly improves translation into low-resource language compared to other translation baselines.

pdf bib
Gender bias amplification during Speed-Quality optimization in Neural Machine Translation
Adithya Renduchintala | Denise Diaz | Kenneth Heafield | Xian Li | Mona Diab
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Is bias amplified when neural machine translation (NMT) models are optimized for speed and evaluated on generic test sets using BLEU? We investigate architectures and techniques commonly used to speed up decoding in Transformer-based models, such as greedy search, quantization, average attention networks (AANs) and shallow decoder models and show their effect on gendered noun translation. We construct a new gender bias test set, SimpleGEN, based on gendered noun phrases in which there is a single, unambiguous, correct answer. While we find minimal overall BLEU degradation as we apply speed optimizations, we observe that gendered noun translation performance degrades at a much faster rate.

pdf bib
Discrete Cosine Transform as Universal Sentence Encoder
Nada Almarwani | Mona Diab
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Modern sentence encoders are used to generate dense vector representations that capture the underlying linguistic characteristics for a sequence of words, including phrases, sentences, or paragraphs. These kinds of representations are ideal for training a classifier for an end task such as sentiment analysis, question answering and text classification. Different models have been proposed to efficiently generate general purpose sentence representations to be used in pretraining protocols. While averaging is the most commonly used efficient sentence encoder, Discrete Cosine Transform (DCT) was recently proposed as an alternative that captures the underlying syntactic characteristics of a given text without compromising practical efficiency compared to averaging. However, as with most other sentence encoders, the DCT sentence encoder was only evaluated in English. To this end, we utilize DCT encoder to generate universal sentence representation for different languages such as German, French, Spanish and Russian. The experimental results clearly show the superior effectiveness of DCT encoding in which consistent performance improvements are achieved over strong baselines on multiple standardized datasets

2020

pdf bib
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching
Thamar Solorio | Monojit Choudhury | Kalika Bali | Sunayana Sitaram | Amitava Das | Mona Diab
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching

pdf bib
Proceedings of the First Workshop on Natural Language Processing for Medical Conversations
Parminder Bhatia | Steven Lin | Rashmi Gangadharaiah | Byron Wallace | Izhak Shafran | Chaitanya Shivade | Nan Du | Mona Diab
Proceedings of the First Workshop on Natural Language Processing for Medical Conversations

pdf bib
FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization
Esin Durmus | He He | Mona Diab
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural abstractive summarization models are prone to generate content inconsistent with the source document, i.e. unfaithful. Existing automatic metrics do not capture such mistakes effectively. We tackle the problem of evaluating faithfulness of a generated summary given its source document. We first collected human annotations of faithfulness for outputs from numerous models on two datasets. We find that current models exhibit a trade-off between abstractiveness and faithfulness: outputs with less word overlap with the source document are more likely to be unfaithful. Next, we propose an automatic question answering (QA) based metric for faithfulness, FEQA, which leverages recent advances in reading comprehension. Given question-answer pairs generated from the summary, a QA model extracts answers from the document; non-matched answers indicate unfaithful information in the summary. Among metrics based on word overlap, embedding similarity, and learned language understanding models, our QA-based metric has significantly higher correlation with human faithfulness scores, especially on highly abstractive summaries.

pdf bib
A Multitask Learning Approach for Diacritic Restoration
Sawsan Alqahtani | Ajay Mishra | Mona Diab
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In many languages like Arabic, diacritics are used to specify pronunciations as well as meanings. Such diacritics are often omitted in written text, increasing the number of possible pronunciations and meanings for a word. This results in a more ambiguous text making computational processing on such text more difficult. Diacritic restoration is the task of restoring missing diacritics in the written text. Most state-of-the-art diacritic restoration models are built on character level information which helps generalize the model to unseen data, but presumably lose useful information at the word level. Thus, to compensate for this loss, we investigate the use of multi-task learning to jointly optimize diacritic restoration with related NLP problems namely word segmentation, part-of-speech tagging, and syntactic diacritization. We use Arabic as a case study since it has sufficient data resources for tasks that we consider in our joint modeling. Our joint models significantly outperform the baselines and are comparable to the state-of-the-art models that are more complex relying on morphological analyzers and/or a lot more data (e.g. dialectal data).

pdf bib
DeSePtion: Dual Sequence Prediction and Adversarial Examples for Improved Fact-Checking
Christopher Hidey | Tuhin Chakrabarty | Tariq Alhindi | Siddharth Varia | Kriste Krstovski | Mona Diab | Smaranda Muresan
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The increased focus on misinformation has spurred development of data and systems for detecting the veracity of a claim as well as retrieving authoritative evidence. The Fact Extraction and VERification (FEVER) dataset provides such a resource for evaluating endto- end fact-checking, requiring retrieval of evidence from Wikipedia to validate a veracity prediction. We show that current systems for FEVER are vulnerable to three categories of realistic challenges for fact-checking – multiple propositions, temporal reasoning, and ambiguity and lexical variation – and introduce a resource with these types of claims. Then we present a system designed to be resilient to these “attacks” using multiple pointer networks for document selection and jointly modeling a sequence of evidence sentences and veracity relation predictions. We find that in handling these attacks we obtain state-of-the-art results on FEVER, largely due to improved evidence retrieval.

pdf bib
Detecting Urgency Status of Crisis Tweets: A Transfer Learning Approach for Low Resource Languages
Efsun Sarioglu Kayi | Linyong Nan | Bohan Qu | Mona Diab | Kathleen McKeown
Proceedings of the 28th International Conference on Computational Linguistics

We release an urgency dataset that consists of English tweets relating to natural crises, along with annotations of their corresponding urgency status. Additionally, we release evaluation datasets for two low-resource languages, i.e. Sinhala and Odia, and demonstrate an effective zero-shot transfer from English to these two languages by training cross-lingual classifiers. We adopt cross-lingual embeddings constructed using different methods to extract features of the tweets, including a few state-of-the-art contextual embeddings such as BERT, RoBERTa and XLM-R. We train classifiers of different architectures on the extracted features. We also explore semi-supervised approaches by utilizing unlabeled tweets and experiment with ensembling different classifiers. With very limited amounts of labeled data in English and zero data in the low resource languages, we show a successful framework of training monolingual and cross-lingual classifiers using deep learning methods which are known to be data hungry. Specifically, we show that the recent deep contextual embeddings are also helpful when dealing with very small-scale datasets. Classifiers that incorporate RoBERTa yield the best performance for English urgency detection task, with F1 scores that are more than 25 points over our baseline classifier. For the zero-shot transfer to low resource languages, classifiers that use LASER features perform the best for Sinhala transfer while XLM-R features benefit the Odia transfer the most.

pdf bib
Multitask Learning for Cross-Lingual Transfer of Broad-coverage Semantic Dependencies
Maryam Aminian | Mohammad Sadegh Rasooli | Mona Diab
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We describe a method for developing broad-coverage semantic dependency parsers for languages for which no semantically annotated resource is available. We leverage a multitask learning framework coupled with annotation projection. We use syntactic parsing as the auxiliary task in our multitask setup. Our annotation projection experiments from English to Czech show that our multitask setup yields 3.1% (4.2%) improvement in labeled F1-score on in-domain (out-of-domain) test set compared to a single-task baseline.

pdf bib
Learning to Classify Intents and Slot Labels Given a Handful of Examples
Jason Krone | Yi Zhang | Mona Diab
Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI

Intent classification (IC) and slot filling (SF) are core components in most goal-oriented dialogue systems. Current IC/SF models perform poorly when the number of training examples per class is small. We propose a new few-shot learning task, few-shot IC/SF, to study and improve the performance of IC and SF models on classes not seen at training time in ultra low resource scenarios. We establish a few-shot IC/SF benchmark by defining few-shot splits for three public IC/SF datasets, ATIS, TOP, and Snips. We show that two popular few-shot learning algorithms, model agnostic meta learning (MAML) and prototypical networks, outperform a fine-tuning baseline on this benchmark. Prototypical networks achieves significant gains in IC performance on the ATIS and TOP datasets, while both prototypical networks and MAML outperform the baseline with respect to SF on all three datasets. In addition, we demonstrate that joint training as well as the use of pre-trained language models, ELMo and BERT in our case, are complementary to these few-shot learning methods and yield further gains.

pdf bib
Diversity, Density, and Homogeneity: Quantitative Characteristic Metrics for Text Collections
Yi-An Lai | Xuan Zhu | Yi Zhang | Mona Diab
Proceedings of the 12th Language Resources and Evaluation Conference

Summarizing data samples by quantitative measures has a long history, with descriptive statistics being a case in point. However, as natural language processing methods flourish, there are still insufficient characteristic metrics to describe a collection of texts in terms of the words, sentences, or paragraphs they comprise. In this work, we propose metrics of diversity, density, and homogeneity that quantitatively measure the dispersion, sparsity, and uniformity of a text collection. We conduct a series of simulations to verify that each metric holds desired properties and resonates with human intuitions. Experiments on real-world datasets demonstrate that the proposed characteristic metrics are highly correlated with text classification performance of a renowned model, BERT, which could inspire future applications.

2019

pdf bib
Scalable Cross-Lingual Transfer of Neural Sentence Embeddings
Hanan Aldarmaki | Mona Diab
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

We develop and investigate several cross-lingual alignment approaches for neural sentence embedding models, such as the supervised inference classifier, InferSent, and sequential encoder-decoder models. We evaluate three alignment frameworks applied to these models: joint modeling, representation transfer learning, and sentence mapping, using parallel text to guide the alignment. Our results support representation transfer as a scalable approach for modular cross-lingual alignment of neural sentence embeddings, where we observe better performance compared to joint models in intrinsic and extrinsic evaluations, particularly with smaller sets of parallel data.

pdf bib
GWU NLP Lab at SemEval-2019 Task 3 : EmoContext: Effectiveness ofContextual Information in Models for Emotion Detection inSentence-level at Multi-genre Corpus
Shabnam Tafreshi | Mona Diab
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper we present an emotion classifier models that submitted to the SemEval-2019 Task 3 : EmoContext. Our approach is a Gated Recurrent Neural Network (GRU) model with attention layer is bootstrapped with contextual information and trained with a multigenre corpus, which is combination of several popular emotional data sets. We utilize different word embeddings to empirically select the most suited embedding to represent our features. Our aim is to build a robust emotion classifier that can generalize emotion detection, which is to learn emotion cues in a noisy training environment. To fulfill this aim we train our model with a multigenre emotion corpus, this way we leverage from having more training set. We achieved overall %56.05 f1-score and placed 144. Given our aim and noisy training environment, the results are anticipated.

pdf bib
GWU NLP at SemEval-2019 Task 7: Hybrid Pipeline for Rumour Veracity and Stance Classification on Social Media
Sardar Hamidian | Mona Diab
Proceedings of the 13th International Workshop on Semantic Evaluation

Social media plays a crucial role as the main resource news for information seekers online. However, the unmoderated feature of social media platforms lead to the emergence and spread of untrustworthy contents which harm individuals or even societies. Most of the current automated approaches for automatically determining the veracity of a rumor are not generalizable for novel emerging topics. This paper describes our hybrid system comprising rules and a machine learning model which makes use of replied tweets to identify the veracity of the source tweet. The proposed system in this paper achieved 0.435 F-Macro in stance classification, and 0.262 F-macro and 0.801 RMSE in rumor verification tasks in Task7 of SemEval 2019.

pdf bib
CASA-NLU: Context-Aware Self-Attentive Natural Language Understanding for Task-Oriented Chatbots
Arshit Gupta | Peng Zhang | Garima Lalwani | Mona Diab
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Natural Language Understanding (NLU) is a core component of dialog systems. It typically involves two tasks - Intent Classification (IC) and Slot Labeling (SL), which are then followed by a dialogue management (DM) component. Such NLU systems cater to utterances in isolation, thus pushing the problem of context management to DM. However, contextual information is critical to the correct prediction of intents in a conversation. Prior work on contextual NLU has been limited in terms of the types of contextual signals used and the understanding of their impact on the model. In this work, we propose a context-aware self-attentive NLU (CASA-NLU) model that uses multiple signals over a variable context window, such as previous intents, slots, dialog acts and utterances, in addition to the current user utterance. CASA-NLU outperforms a recurrent contextual NLU baseline on two conversational datasets, yielding a gain of up to 7% on the IC task. Moreover, a non-contextual variant of CASA-NLU achieves state-of-the-art performance on standard public datasets - SNIPS and ATIS.

pdf bib
Efficient Convolutional Neural Networks for Diacritic Restoration
Sawsan Alqahtani | Ajay Mishra | Mona Diab
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Diacritic restoration has gained importance with the growing need for machines to understand written texts. The task is typically modeled as a sequence labeling problem and currently Bidirectional Long Short Term Memory (BiLSTM) models provide state-of-the-art results. Recently, Bai et al. (2018) show the advantages of Temporal Convolutional Neural Networks (TCN) over Recurrent Neural Networks (RNN) for sequence modeling in terms of performance and computational resources. As diacritic restoration benefits from both previous as well as subsequent timesteps, we further apply and evaluate a variant of TCN, Acausal TCN (A-TCN), which incorporates context from both directions (previous and future) rather than strictly incorporating previous context as in the case of TCN. A-TCN yields significant improvement over TCN for diacritization in three different languages: Arabic, Yoruba, and Vietnamese. Furthermore, A-TCN and BiLSTM have comparable performance, making A-TCN an efficient alternative over BiLSTM since convolutions can be trained in parallel. A-TCN is significantly faster than BiLSTM at inference time (270% 334% improvement in the amount of text diacritized per minute).

pdf bib
Efficient Sentence Embedding using Discrete Cosine Transform
Nada Almarwani | Hanan Aldarmaki | Mona Diab
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Vector averaging remains one of the most popular sentence embedding methods in spite of its obvious disregard for syntactic structure. While more complex sequential or convolutional networks potentially yield superior classification performance, the improvements in classification accuracy are typically mediocre compared to the simple vector averaging. As an efficient alternative, we propose the use of discrete cosine transform (DCT) to compress word sequences in an order-preserving manner. The lower order DCT coefficients represent the overall feature patterns in sentences, which results in suitable embeddings for tasks that could benefit from syntactic features. Our results in semantic probing tasks demonstrate that DCT embeddings indeed preserve more syntactic information compared with vector averaging. With practically equivalent complexity, the model yields better overall performance in downstream classification tasks that correlate with syntactic features, which illustrates the capacity of DCT to preserve word order information.

pdf bib
Multi-Domain Goal-Oriented Dialogues (MultiDoGO): Strategies toward Curating and Annotating Large Scale Dialogue Data
Denis Peskov | Nancy Clarke | Jason Krone | Brigi Fodor | Yi Zhang | Adel Youssef | Mona Diab
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The need for high-quality, large-scale, goal-oriented dialogue datasets continues to grow as virtual assistants become increasingly wide-spread. However, publicly available datasets useful for this area are limited either in their size, linguistic diversity, domain coverage, or annotation granularity. In this paper, we present strategies toward curating and annotating large scale goal oriented dialogue data. We introduce the MultiDoGO dataset to overcome these limitations. With a total of over 81K dialogues harvested across six domains, MultiDoGO is over 8 times the size of MultiWOZ, the other largest comparable dialogue dataset currently available to the public. Over 54K of these harvested conversations are annotated for intent classes and slot labels. We adopt a Wizard-of-Oz approach wherein a crowd-sourced worker (the “customer”) is paired with a trained annotator (the “agent”). The data curation process was controlled via biases to ensure a diversity in dialogue flows following variable dialogue policies. We provide distinct class label tags for agents vs. customer utterances, along with applicable slot labels. We also compare and contrast our strategies on annotation granularity, i.e. turn vs. sentence level. Furthermore, we compare and contrast annotations curated by leveraging professional annotators vs the crowd. We believe our strategies for eliciting and annotating such a dialogue dataset scales across modalities and domains and potentially languages in the future. To demonstrate the efficacy of our devised strategies we establish neural baselines for classification on the agent and customer utterances as well as slot labeling for each domain.

pdf bib
Identifying Nuances in Fake News vs. Satire: Using Semantic and Linguistic Cues
Or Levi | Pedram Hosseini | Mona Diab | David Broniatowski
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda

The blurry line between nefarious fake news and protected-speech satire has been a notorious struggle for social media platforms. Further to the efforts of reducing exposure to misinformation on social media, purveyors of fake news have begun to masquerade as satire sites to avoid being demoted. In this work, we address the challenge of automatically classifying fake news versus satire. Previous work have studied whether fake news and satire can be distinguished based on language differences. Contrary to fake news, satire stories are usually humorous and carry some political or social message. We hypothesize that these nuances could be identified using semantic and linguistic cues. Consequently, we train a machine learning method using semantic representation, with a state-of-the-art contextual language model, and with linguistic features based on textual coherence metrics. Empirical evaluation attests to the merits of our approach compared to the language-based baseline and sheds light on the nuances between fake news and satire. As avenues for future work, we consider studying additional linguistic features related to the humor aspect, and enriching the data with current news events, to help identify a political or social message.

pdf bib
Cross-Lingual Transfer of Semantic Roles: From Raw Text to Semantic Roles
Maryam Aminian | Mohammad Sadegh Rasooli | Mona Diab
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

We describe a transfer method based on annotation projection to develop a dependency-based semantic role labeling system for languages for which no supervised linguistic information other than parallel data is available. Unlike previous work that presumes the availability of supervised features such as lemmas, part-of-speech tags, and dependency parse trees, we only make use of word and character features. Our deep model considers using character-based representations as well as unsupervised stem embeddings to alleviate the need for supervised features. Our experiments outperform a state-of-the-art method that uses supervised lexico-syntactic features on 6 out of 7 languages in the Universal Proposition Bank.

pdf bib
Leveraging Pretrained Word Embeddings for Part-of-Speech Tagging of Code Switching Data
Fahad AlGhamdi | Mona Diab
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

Linguistic Code Switching (CS) is a phenomenon that occurs when multilingual speakers alternate between two or more languages/dialects within a single conversation. Processing CS data is especially challenging in intra-sentential data given state-of-the-art monolingual NLP technologies since such technologies are geared toward the processing of one language at a time. In this paper, we address the problem of Part-of-Speech tagging (POS) in the context of linguistic code switching (CS). We explore leveraging multiple neural network architectures to measure the impact of different pre-trained embeddings methods on POS tagging CS data. We investigate the landscape in four CS language pairs, Spanish-English, Hindi-English, Modern Standard Arabic- Egyptian Arabic dialect (MSA-EGY), and Modern Standard Arabic- Levantine Arabic dialect (MSA-LEV). Our results show that multilingual embedding (e.g., MSA-EGY and MSA-LEV) helps closely related languages (EGY/LEV) but adds noise to the languages that are distant (SPA/HIN). Finally, we show that our proposed models outperform state-of-the-art CS taggers for MSA-EGY language pair.

pdf bib
Homograph Disambiguation through Selective Diacritic Restoration
Sawsan Alqahtani | Hanan Aldarmaki | Mona Diab
Proceedings of the Fourth Arabic Natural Language Processing Workshop

Lexical ambiguity, a challenging phenomenon in all natural languages, is particularly prevalent for languages with diacritics that tend to be omitted in writing, such as Arabic. Omitting diacritics leads to an increase in the number of homographs: different words with the same spelling. Diacritic restoration could theoretically help disambiguate these words, but in practice, the increase in overall sparsity leads to performance degradation in NLP applications. In this paper, we propose approaches for automatically marking a subset of words for diacritic restoration, which leads to selective homograph disambiguation. Compared to full or no diacritic restoration, these approaches yield selectively-diacritized datasets that balance sparsity and lexical disambiguation. We evaluate the various selection strategies extrinsically on several downstream applications: neural machine translation, part-of-speech tagging, and semantic textual similarity. Our experiments on Arabic show promising results, where our devised strategies on selective diacritization lead to a more balanced and consistent performance in downstream applications.

pdf bib
Context-Aware Cross-Lingual Mapping
Hanan Aldarmaki | Mona Diab
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Cross-lingual word vectors are typically obtained by fitting an orthogonal matrix that maps the entries of a bilingual dictionary from a source to a target vector space. Word vectors, however, are most commonly used for sentence or document-level representations that are calculated as the weighted average of word embeddings. In this paper, we propose an alternative to word-level mapping that better reflects sentence-level cross-lingual similarity. We incorporate context in the transformation matrix by directly mapping the averaged embeddings of aligned sentences in a parallel corpus. We also implement cross-lingual mapping of deep contextualized word embeddings using parallel sentences with word alignments. In our experiments, both approaches resulted in cross-lingual sentence embeddings that outperformed context-independent word mapping in sentence translation retrieval. Furthermore, the sentence-level transformation could be used for word-level mapping without loss in word translation quality.

2018

pdf bib
WASA: A Web Application for Sequence Annotation
Fahad AlGhamdi | Mona Diab
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Sentence and Clause Level Emotion Annotation, Detection, and Classification in a Multi-Genre Corpus
Shabnam Tafreshi | Mona Diab
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Unsupervised Word Mapping Using Structural Similarities in Monolingual Embeddings
Hanan Aldarmaki | Mahesh Mohan | Mona Diab
Transactions of the Association for Computational Linguistics, Volume 6

Most existing methods for automatic bilingual dictionary induction rely on prior alignments between the source and target languages, such as parallel corpora or seed dictionaries. For many language pairs, such supervised alignments are not readily available. We propose an unsupervised approach for learning a bilingual dictionary for a pair of languages given their independently-learned monolingual word embeddings. The proposed method exploits local and global structures in monolingual vector spaces to align them such that similar words are mapped to each other. We show empirically that the performance of bilingual correspondents that are learned using our proposed unsupervised method is comparable to that of using supervised bilingual correspondents from a seed dictionary.

pdf bib
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching
Gustavo Aguilar | Fahad AlGhamdi | Victor Soto | Thamar Solorio | Mona Diab | Julia Hirschberg
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

pdf bib
Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task
Gustavo Aguilar | Fahad AlGhamdi | Victor Soto | Mona Diab | Julia Hirschberg | Thamar Solorio
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

In the third shared task of the Computational Approaches to Linguistic Code-Switching (CALCS) workshop, we focus on Named Entity Recognition (NER) on code-switched social-media data. We divide the shared task into two competitions based on the English-Spanish (ENG-SPA) and Modern Standard Arabic-Egyptian (MSA-EGY) language pairs. We use Twitter data and 9 entity types to establish a new dataset for code-switched NER benchmarks. In addition to the CS phenomenon, the diversity of the entities and the social media challenges make the task considerably hard to process. As a result, the best scores of the competitions are 63.76% and 71.61% for ENG-SPA and MSA-EGY, respectively. We present the scores of 9 participants and discuss the most common challenges among submissions.

pdf bib
Team SWEEPer: Joint Sentence Extraction and Fact Checking with Pointer Networks
Christopher Hidey | Mona Diab
Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)

Many tasks such as question answering and reading comprehension rely on information extracted from unreliable sources. These systems would thus benefit from knowing whether a statement from an unreliable source is correct. We present experiments on the FEVER (Fact Extraction and VERification) task, a shared task that involves selecting sentences from Wikipedia and predicting whether a claim is supported by those sentences, refuted, or there is not enough information. Fact checking is a task that benefits from not only asserting or disputing the veracity of a claim but also finding evidence for that position. As these tasks are dependent on each other, an ideal model would consider the veracity of the claim when finding evidence and also find only the evidence that is relevant. We thus jointly model sentence extraction and verification on the FEVER shared task. Among all participants, we ranked 5th on the blind test set (prior to any additional human evaluation of the evidence).

pdf bib
Evaluation of Unsupervised Compositional Representations
Hanan Aldarmaki | Mona Diab
Proceedings of the 27th International Conference on Computational Linguistics

We evaluated various compositional models, from bag-of-words representations to compositional RNN-based models, on several extrinsic supervised and unsupervised evaluation benchmarks. Our results confirm that weighted vector averaging can outperform context-sensitive models in most benchmarks, but structural features encoded in RNN models can also be useful in certain classification tasks. We analyzed some of the evaluation datasets to identify the aspects of meaning they measure and the characteristics of the various models that explain their performance variance.

pdf bib
Emotion Detection and Classification in a Multigenre Corpus with Joint Multi-Task Deep Learning
Shabnam Tafreshi | Mona Diab
Proceedings of the 27th International Conference on Computational Linguistics

Detection and classification of emotion categories expressed by a sentence is a challenging task due to subjectivity of emotion. To date, most of the models are trained and evaluated on single genre and when used to predict emotion in different genre their performance drops by a large margin. To address the issue of robustness, we model the problem within a joint multi-task learning framework. We train this model with a multigenre emotion corpus to predict emotions across various genre. Each genre is represented as a separate task, we use soft parameter shared layers across the various tasks. our experimental results show that this model improves the results across the various genres, compared to a single genre training in the same neural net architecture.

2017

pdf bib
Transferring Semantic Roles Using Translation and Syntactic Information
Maryam Aminian | Mohammad Sadegh Rasooli | Mona Diab
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Our paper addresses the problem of annotation projection for semantic role labeling for resource-poor languages using supervised annotations from a resource-rich language through parallel data. We propose a transfer method that employs information from source and target syntactic dependencies as well as word alignment density to improve the quality of an iterative bootstrapping method. Our experiments yield a 3.5 absolute labeled F-score improvement over a standard annotation projection method.

pdf bib
Proceedings of the Third Arabic Natural Language Processing Workshop
Nizar Habash | Mona Diab | Kareem Darwish | Wassim El-Hajj | Hend Al-Khalifa | Houda Bouamor | Nadi Tomeh | Mahmoud El-Haj | Wajdi Zaghouani
Proceedings of the Third Arabic Natural Language Processing Workshop

pdf bib
A Layered Language Model based Hybrid Approach to Automatic Full Diacritization of Arabic
Mohamed Al-Badrashiny | Abdelati Hawwari | Mona Diab
Proceedings of the Third Arabic Natural Language Processing Workshop

In this paper we present a system for automatic Arabic text diacritization using three levels of analysis granularity in a layered back off manner. We build and exploit diacritized language models (LM) for each of three different levels of granularity: surface form, morphologically segmented into prefix/stem/suffix, and character level. For each of the passes, we use Viterbi search to pick the most probable diacritization per word in the input. We start with the surface form LM, followed by the morphological level, then finally we leverage the character level LM. Our system outperforms all of the published systems evaluated against the same training and test data. It achieves a 10.87% WER for complete full diacritization including lexical and syntactic diacritization, and 3.0% WER for lexical diacritization, ignoring syntactic diacritization.

pdf bib
Arabic Textual Entailment with Word Embeddings
Nada Almarwani | Mona Diab
Proceedings of the Third Arabic Natural Language Processing Workshop

Determining the textual entailment between texts is important in many NLP tasks, such as summarization, question answering, and information extraction and retrieval. Various methods have been suggested based on external knowledge sources; however, such resources are not always available in all languages and their acquisition is typically laborious and very costly. Distributional word representations such as word embeddings learned over large corpora have been shown to capture syntactic and semantic word relationships. Such models have contributed to improving the performance of several NLP tasks. In this paper, we address the problem of textual entailment in Arabic. We employ both traditional features and distributional representations. Crucially, we do not depend on any external resources in the process. Our suggested approach yields state of the art performance on a standard data set, ArbTE, achieving an accuracy of 76.2 % compared to state of the art of 69.3 %.

pdf bib
Predictive Linguistic Features of Schizophrenia
Efsun Sarioglu Kayi | Mona Diab | Luca Pauselli | Michael Compton | Glen Coppersmith
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

Schizophrenia is one of the most disabling and difficult to treat of all human medical/health conditions, ranking in the top ten causes of disability worldwide. It has been a puzzle in part due to difficulty in identifying its basic, fundamental components. Several studies have shown that some manifestations of schizophrenia (e.g., the negative symptoms that include blunting of speech prosody, as well as the disorganization symptoms that lead to disordered language) can be understood from the perspective of linguistics. However, schizophrenia research has not kept pace with technologies in computational linguistics, especially in semantics and pragmatics. As such, we examine the writings of schizophrenia patients analyzing their syntax, semantics and pragmatics. In addition, we analyze tweets of (self proclaimed) schizophrenia patients who publicly discuss their diagnoses. For writing samples dataset, syntactic features are found to be the most successful in classification whereas for the less structured Twitter dataset, a combination of features performed the best.

pdf bib
SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation
Daniel Cer | Mona Diab | Eneko Agirre | Iñigo Lopez-Gazpio | Lucia Specia
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017).

pdf bib
GW_QA at SemEval-2017 Task 3: Question Answer Re-ranking on Arabic Fora
Nada Almarwani | Mona Diab
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes our submission to SemEval-2017 Task 3 Subtask D, “Question Answer Ranking in Arabic Community Question Answering”. In this work, we applied a supervised machine learning approach to automatically re-rank a set of QA pairs according to their relevance to a given question. We employ features based on latent semantic models, namely WTMF, as well as a set of lexical features based on string lengths and surface level matching. The proposed system ranked first out of 3 submissions, with a MAP score of 61.16%.

2016

pdf bib
Rumor Identification and Belief Investigation on Twitter
Sardar Hamidian | Mona Diab
Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
Learning Cross-lingual Representations with Matrix Factorization
Hanan Aldarmaki | Mona Diab
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP

pdf bib
Addressing Annotation Complexity: The Case of Annotating Ideological Perspective in Egyptian Social Media
Heba Elfardy | Mona Diab
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

pdf bib
Using Ambiguity Detection to Streamline Linguistic Annotation
Wajdi Zaghouani | Abdelati Hawwari | Sawsan Alqahtani | Houda Bouamor | Mahmoud Ghoneim | Mona Diab | Kemal Oflazer
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

Arabic writing is typically underspecified for short vowels and other markups, referred to as diacritics. In addition to the lexical ambiguity exhibited in most languages, the lack of diacritics in written Arabic adds another layer of ambiguity which is an artifact of the orthography. In this paper, we present the details of three annotation experimental conditions designed to study the impact of automatic ambiguity detection, on annotation speed and quality in a large scale annotation project.

pdf bib
The GW/LT3 VarDial 2016 Shared Task System for Dialects and Similar Languages Detection
Ayah Zirikly | Bart Desmet | Mona Diab
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

This paper describes the GW/LT3 contribution to the 2016 VarDial shared task on the identification of similar languages (task 1) and Arabic dialects (task 2). For both tasks, we experimented with Logistic Regression and Neural Network classifiers in isolation. Additionally, we implemented a cascaded classifier that consists of coarse and fine-grained classifiers (task 1) and a classifier ensemble with majority voting for task 2. The submitted systems obtained state-of-the art performance and ranked first for the evaluation on social media data (test sets B1 and B2 for task 1), with a maximum weighted F1 score of 91.94%.

pdf bib
Processing Dialectal Arabic: Exploiting Variability and Similarity to Overcome Challenges and Discover Opportunities
Mona Diab
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

We recently witnessed an exponential growth in dialectal Arabic usage in both textual data and speech recordings especially in social media. Processing such media is of great utility for all kinds of applications ranging from information extraction to social media analytics for political and commercial purposes to building decision support systems. Compared to other languages, Arabic, especially the informal variety, poses a significant challenge to natural language processing algorithms since it comprises multiple dialects, linguistic code switching, and a lack of standardized orthographies, to top its relatively complex morphology. Inherently, the problem of processing Arabic in the context of social media is the problem of how to handle resource poor languages. In this talk I will go over some of our insights to some of these problems and show how there is a silver lining where we can generalize some of our solutions to other low resource language contexts.

pdf bib
Automatic Verification and Augmentation of Multilingual Lexicons
Maryam Aminian | Mohamed Al-Badrashiny | Mona Diab
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

We present an approach for automatic verification and augmentation of multilingual lexica. We exploit existing parallel and monolingual corpora to extract multilingual correspondents via tri-angulation. We demonstrate the efficacy of our approach on two publicly available resources: Tharwa, a three-way lexicon comprising Dialectal Arabic, Modern Standard Arabic and English lemmas among other information (Diab et al., 2014); and BabelNet, a multilingual thesaurus comprising over 276 languages including Arabic variant entries (Navigli and Ponzetto, 2012). Our automated approach yields an F1-score of 71.71% in generating correct multilingual correspondents against gold Tharwa, and 54.46% against gold BabelNet without any human intervention.

pdf bib
The Power of Language Music: Arabic Lemmatization through Patterns
Mohammed Attia | Ayah Zirikly | Mona Diab
Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex - V)

The interaction between roots and patterns in Arabic has intrigued lexicographers and morphologists for centuries. While roots provide the consonantal building blocks, patterns provide the syllabic vocalic moulds. While roots provide abstract semantic classes, patterns realize these classes in specific instances. In this way both roots and patterns are indispensable for understanding the derivational, morphological and, to some extent, the cognitive aspects of the Arabic language. In this paper we perform lemmatization (a high-level lexical processing) without relying on a lookup dictionary. We use a hybrid approach that consists of a machine learning classifier to predict the lemma pattern for a given stem, and mapping rules to convert stems to their respective lemmas with the vocalization defined by the pattern.

pdf bib
SAMER: A Semi-Automatically Created Lexical Resource for Arabic Verbal Multiword Expressions Tokens Paradigm and their Morphosyntactic Features
Mohamed Al-Badrashiny | Abdelati Hawwari | Mahmoud Ghoneim | Mona Diab
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

Although MWE are relatively morphologically and syntactically fixed expressions, several types of flexibility can be observed in MWE, verbal MWE in particular. Identifying the degree of morphological and syntactic flexibility of MWE is very important for many Lexicographic and NLP tasks. Adding MWE variants/tokens to a dictionary resource requires characterizing the flexibility among other morphosyntactic features. Carrying out the task manually faces several challenges since it is a very laborious task time and effort wise, as well as it will suffer from coverage limitation. The problem is exacerbated in rich morphological languages where the average word in Arabic could have 12 possible inflection forms. Accordingly, in this paper we introduce a semi-automatic Arabic multiwords expressions resource (SAMER). We propose an automated method that identifies the morphological and syntactic flexibility of Arabic Verbal Multiword Expressions (AVMWE). All observed morphological variants and syntactic pattern alternations of an AVMWE are automatically acquired using large scale corpora. We look for three morphosyntactic aspects of AVMWE types investigating derivational and inflectional variations and syntactic templates, namely: 1) inflectional variation (inflectional paradigm) and calculating degree of flexibility; 2) derivational productivity; and 3) identifying and classifying the different syntactic types. We build a comprehensive list of AVMWE. Every token in the AVMWE list is lemmatized and tagged with POS information. We then search Arabic Gigaword and All ATBs for all possible flexible matches. For each AVMWE type we generate: a) a statistically ranked list of MWE-lexeme inflections and syntactic pattern alternations; b) An abstract syntactic template; and c) The most frequent form. Our technique is validated using a Golden MWE annotated list. The results shows that the quality of the generated resource is 80.04%.

pdf bib
Proceedings of the Second Workshop on Computational Approaches to Code Switching
Mona Diab | Pascale Fung | Mahmoud Ghoneim | Julia Hirschberg | Thamar Solorio
Proceedings of the Second Workshop on Computational Approaches to Code Switching

pdf bib
Overview for the Second Shared Task on Language Identification in Code-Switched Data
Giovanni Molina | Fahad AlGhamdi | Mahmoud Ghoneim | Abdelati Hawwari | Nicolas Rey-Villamizar | Mona Diab | Thamar Solorio
Proceedings of the Second Workshop on Computational Approaches to Code Switching

pdf bib
Part of Speech Tagging for Code Switched Data
Fahad AlGhamdi | Giovanni Molina | Mona Diab | Thamar Solorio | Abdelati Hawwari | Victor Soto | Julia Hirschberg
Proceedings of the Second Workshop on Computational Approaches to Code Switching

pdf bib
The George Washington University System for the Code-Switching Workshop Shared Task 2016
Mohamed Al-Badrashiny | Mona Diab
Proceedings of the Second Workshop on Computational Approaches to Code Switching

pdf bib
Investigating the Impact of Various Partial Diacritization Schemes on Arabic-English Statistical Machine Translation
Sawsan Alqahtani | Mahmoud Ghoneim | Mona Diab
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track

Most diacritics in Arabic represent short vowels. In Arabic orthography, such diacritics are considered optional. The absence of these diacritics naturally leads to significant word ambiguity to top the inherent ambiguity present in fully diacritized words. Word ambiguity is a significant impediment for machine translation. Despite the ambiguity presented by lack of diacritization, context helps ameliorate the situation. Identifying the appropriate amount of diacritic restoration to reduce word sense ambiguity in the context of machine translation is the object of this paper. Diacritic marks help reduce the number of possible lexical word choices assigned to a source word which leads to better quality translated sentences. We investigate a variety of (linguistically motivated) partial diacritization schemes that preserve some of the semantics that in essence complement the implicit contextual information present in the sentences. We also study the effect of training data size and report results on three standard test sets that represent a combination of different genres. The results show statistically significant improvements for some schemes compared to two baselines: text with no diacritics (the typical writing system adopted for Arabic) and text that is fully diacritized.

pdf bib
CU-GWU Perspective at SemEval-2016 Task 6: Ideological Stance Detection in Informal Text
Heba Elfardy | Mona Diab
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation
Eneko Agirre | Carmen Banea | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre | Rada Mihalcea | German Rigau | Janyce Wiebe
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
GWU NLP at SemEval-2016 Shared Task 1: Matrix Factorization for Crosslingual STS
Hanan Aldarmaki | Mona Diab
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
LILI: A Simple Language Independent Approach for Language Identification
Mohamed Al-Badrashiny | Mona Diab
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We introduce a generic Language Independent Framework for Linguistic Code Switch Point Detection. The system uses characters level 5-grams and word level unigram language models to train a conditional random fields (CRF) model for classifying input words into various languages. We test our proposed framework and compare it to the state-of-the-art published systems on standard data sets from several language pairs: English-Spanish, Nepali-English, English-Hindi, Arabizi (Refers to Arabic written using the Latin/Roman script)-English, Arabic-Engari (Refers to English written using Arabic script), Modern Standard Arabic(MSA)-Egyptian, Levantine-MSA, Gulf-MSA, one more English-Spanish, and one more MSA-EGY. The overall weighted average F-score of each language pair are 96.4%, 97.3%, 98.0%, 97.0%, 98.9%, 86.3%, 88.2%, 90.6%, 95.2%, and 85.0% respectively. The results show that our approach despite its simplicity, either outperforms or performs at comparable levels to state-of-the-art published systems.

pdf bib
Explicit Fine grained Syntactic and Semantic Annotation of the Idafa Construction in Arabic
Abdelati Hawwari | Mohammed Attia | Mahmoud Ghoneim | Mona Diab
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Idafa in traditional Arabic grammar is an umbrella construction that covers several phenomena including what is expressed in English as noun-noun compounds and Saxon and Norman genitives. Additionally, Idafa participates in some other constructions, such as quantifiers, quasi-prepositions, and adjectives. Identifying the various types of the Idafa construction (IC) is of importance to Natural Language processing (NLP) applications. Noun-Noun compounds exhibit special behavior in most languages impacting their semantic interpretation. Hence distinguishing them could have an impact on downstream NLP applications. The most comprehensive syntactic representation of the Arabic language is the LDC Arabic Treebank (ATB). In the ATB, ICs are not explicitly labeled and furthermore, there is no distinction between ICs of noun-noun relations and other traditional ICs. Hence, we devise a detailed syntactic and semantic typification process of the IC phenomenon in Arabic. We target the ATB as a platform for this classification. We render the ATB annotated with explicit IC labels but with the further semantic characterization which is useful for syntactic, semantic and cross language processing. Our typification of IC comprises 3 main syntactic IC types: FIC, GIC, and TIC, and they are further divided into 10 syntactic subclasses. The TIC group is further classified into semantic relations. We devise a method for automatic IC labeling and compare its yield against the CATiB treebank. Our evaluation shows that we achieve the same level of accuracy, but with the additional fine-grained classification into the various syntactic and semantic types.

pdf bib
Guidelines and Framework for a Large Scale Arabic Diacritized Corpus
Wajdi Zaghouani | Houda Bouamor | Abdelati Hawwari | Mona Diab | Ossama Obeid | Mahmoud Ghoneim | Sawsan Alqahtani | Kemal Oflazer
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents the annotation guidelines developed as part of an effort to create a large scale manually diacritized corpus for various Arabic text genres. The target size of the annotated corpus is 2 million words. We summarize the guidelines and describe issues encountered during the training of the annotators. We also discuss the challenges posed by the complexity of the Arabic language and how they are addressed. Finally, we present the diacritization annotation procedure and detail the quality of the resulting annotations.

pdf bib
SPLIT: Smart Preprocessing (Quasi) Language Independent Tool
Mohamed Al-Badrashiny | Arfath Pasha | Mona Diab | Nizar Habash | Owen Rambow | Wael Salloum | Ramy Eskander
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Text preprocessing is an important and necessary task for all NLP applications. A simple variation in any preprocessing step may drastically affect the final results. Moreover replicability and comparability, as much as feasible, is one of the goals of our scientific enterprise, thus building systems that can ensure the consistency in our various pipelines would contribute significantly to our goals. The problem has become quite pronounced with the abundance of NLP tools becoming more and more available yet with different levels of specifications. In this paper, we present a dynamic unified preprocessing framework and tool, SPLIT, that is highly configurable based on user requirements which serves as a preprocessing tool for several tools at once. SPLIT aims to standardize the implementations of the most important preprocessing steps by allowing for a unified API that could be exchanged across different researchers to ensure complete transparency in replication. The user is able to select the required preprocessing tasks among a long list of preprocessing steps. The user is also able to specify the order of execution which in turn affects the final preprocessing output.

pdf bib
Creating a Large Multi-Layered Representational Repository of Linguistic Code Switched Arabic Data
Mona Diab | Mahmoud Ghoneim | Abdelati Hawwari | Fahad AlGhamdi | Nada AlMarwani | Mohamed Al-Badrashiny
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present our effort to create a large Multi-Layered representational repository of Linguistic Code-Switched Arabic data. The process involves developing clear annotation standards and Guidelines, streamlining the annotation process, and implementing quality control measures. We used two main protocols for annotation: in-lab gold annotations and crowd sourcing annotations. We developed a web-based annotation tool to facilitate the management of the annotation process. The current version of the repository contains a total of 886,252 tokens that are tagged into one of sixteen code-switching tags. The data exhibits code switching between Modern Standard Arabic and Egyptian Dialectal Arabic representing three data genres: Tweets, commentaries, and discussion fora. The overall Inter-Annotator Agreement is 93.1%.

2015

pdf bib
AIDA2: A Hybrid Approach for Token and Sentence Level Dialect Identification in Arabic
Mohamed Al-Badrashiny | Heba Elfardy | Mona Diab
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

pdf bib
A New Dataset and Evaluation for Belief/Factuality
Vinodkumar Prabhakaran | Tomas By | Julia Hirschberg | Owen Rambow | Samira Shaikh | Tomek Strzalkowski | Jennifer Tracey | Michael Arrigo | Rupayan Basu | Micah Clark | Adam Dalton | Mona Diab | Louise Guthrie | Anna Prokofieva | Stephanie Strassel | Gregory Werner | Yorick Wilks | Janyce Wiebe
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf bib
Ideological Perspective Detection Using Semantic Features
Heba Elfardy | Mona Diab | Chris Callison-Burch
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf bib
SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability
Eneko Agirre | Carmen Banea | Claire Cardie | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre | Weiwei Guo | Iñigo Lopez-Gazpio | Montse Maritxalar | Rada Mihalcea | German Rigau | Larraitz Uria | Janyce Wiebe
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
Unsupervised False Friend Disambiguation Using Contextual Word Clusters and Parallel Word Alignments
Maryam Aminian | Mahmoud Ghoneim | Mona Diab
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Committed Belief Tagging on the Factbank and LU Corpora: A Comparative Study
Gregory Werner | Vinodkumar Prabhakaran | Mona Diab | Owen Rambow
Proceedings of the Second Workshop on Extra-Propositional Aspects of Meaning in Computational Semantics (ExProM 2015)

pdf bib
Named Entity Recognition for Arabic Social Media
Ayah Zirikly | Mona Diab
Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing

pdf bib
A Pilot Study on Arabic Multi-Genre Corpus Diacritization
Houda Bouamor | Wajdi Zaghouani | Mona Diab | Ossama Obeid | Kemal Oflazer | Mahmoud Ghoneim | Abdelati Hawwari
Proceedings of the Second Workshop on Arabic Natural Language Processing

pdf bib
GWU-HASP-2015@QALB-2015 Shared Task: Priming Spelling Candidates with Probability
Mohammed Attia | Mohamed Al-Badrashiny | Mona Diab
Proceedings of the Second Workshop on Arabic Natural Language Processing

pdf bib
Robust Part-of-speech Tagging of Arabic Text
Hanan Aldarmaki | Mona Diab
Proceedings of the Second Workshop on Arabic Natural Language Processing

2014

pdf bib
A Framework for the Classification and Annotation of Multiword Expressions in Dialectal Arabic
Abdelati Hawwari | Mohammed Attia | Mona Diab
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

pdf bib
Named Entity Recognition System for Dialectal Arabic
Ayah Zirikly | Mona Diab
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

pdf bib
GWU-HASP: Hybrid Arabic Spelling and Punctuation Corrector
Mohammed Attia | Mohamed Al-Badrashiny | Mona Diab
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

pdf bib
Proceedings of the First Workshop on Computational Approaches to Code Switching
Mona Diab | Julia Hirschberg | Pascale Fung | Thamar Solorio
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
Overview for the First Shared Task on Language Identification in Code-Switched Data
Thamar Solorio | Elizabeth Blair | Suraj Maharjan | Steven Bethard | Mona Diab | Mahmoud Ghoneim | Abdelati Hawwari | Fahad AlGhamdi | Julia Hirschberg | Alison Chang | Pascale Fung
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
AIDA: Identifying Code Switching in Informal Arabic Text
Heba Elfardy | Mohamed Al-Badrashiny | Mona Diab
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
Handling OOV Words in Dialectal Arabic to English Machine Translation
Maryam Aminian | Mahmoud Ghoneim | Mona Diab
Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants

pdf bib
SemEval-2014 Task 10: Multilingual Semantic Textual Similarity
Eneko Agirre | Carmen Banea | Claire Cardie | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre | Weiwei Guo | Rada Mihalcea | German Rigau | Janyce Wiebe
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

bib
Natural Language Processing of Arabic and its Dialects
Mona Diab | Nizar Habash
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

This tutorial introduces the different challenges and current solutions to the automatic processing of Arabic and its dialects. The tutorial has two parts: First, we present a discussion of generic issues relevant to Arabic NLP and detail dialectal linguistic issues and the challenges they pose for NLP. In the second part, we review the state-of-the-art in Arabic processing covering several enabling technologies and applications, e.g., dialect identification, morphological processing (analysis, disambiguation, tokenization, POS tagging), parsing, and machine translation.

pdf bib
Fast Tweet Retrieval with Compact Binary Codes
Weiwei Guo | Wei Liu | Mona Diab
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Sentence Level Dialect Identification for Machine Translation System Selection
Wael Salloum | Heba Elfardy | Linda Alamir-Salloum | Nizar Habash | Mona Diab
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Tharwa: A Large Scale Dialectal Arabic - Standard Arabic - English Lexicon
Mona Diab | Mohamed Al-Badrashiny | Maryam Aminian | Mohammed Attia | Heba Elfardy | Nizar Habash | Abdelati Hawwari | Wael Salloum | Pradeep Dasigi | Ramy Eskander
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We introduce an electronic three-way lexicon, Tharwa, comprising Dialectal Arabic, Modern Standard Arabic and English correspondents. The paper focuses on Egyptian Arabic as the first pilot dialect for the resource, with plans to expand to other dialects of Arabic in later phases of the project. We describe Tharwa’s creation process and report on its current status. The lexical entries are augmented with various elements of linguistic information such as POS, gender, rationality, number, and root and pattern information. The lexicon is based on a compilation of information from both monolingual and bilingual existing resources such as paper dictionaries and electronic, corpus-based dictionaries. Multiple levels of quality checks are performed on the output of each step in the creation process. The importance of this lexicon lies in the fact that it is the first resource of its kind bridging multiple variants of Arabic with English. Furthermore, it is a wide coverage lexical resource containing over 73,000 Egyptian entries. Tharwa is publicly available. We believe it will have a significant impact on both Theoretical Linguistics as well as Computational Linguistics research.

pdf bib
MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic
Arfath Pasha | Mohamed Al-Badrashiny | Mona Diab | Ahmed El Kholy | Ramy Eskander | Nizar Habash | Manoj Pooleery | Owen Rambow | Ryan Roth
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA (Habash and Rambow, 2005; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al., 2007). MADAMIRA improves upon the two systems with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude. We also discuss an online demo (see http://nlp.ldeo.columbia.edu/madamira/) that highlights these aspects.

pdf bib
SANA: A Large Scale Multi-Genre, Multi-Dialect Lexicon for Arabic Subjectivity and Sentiment Analysis
Muhammad Abdul-Mageed | Mona Diab
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The computational treatment of subjectivity and sentiment in natural language is usually significantly improved by applying features exploiting lexical resources where entries are tagged with semantic orientation (e.g., positive, negative values). In spite of the fair amount of work on Arabic sentiment analysis over the past few years (e.g., (Abbasi et al., 2008; Abdul-Mageed et al., 2014; Abdul-Mageed et al., 2012; Abdul-Mageed and Diab, 2012a; Abdul-Mageed and Diab, 2012b; Abdul-Mageed et al., 2011a; Abdul-Mageed and Diab, 2011)), the language remains under-resourced as to these polarity repositories compared to the English language. In this paper, we report efforts to build and present SANA, a large-scale, multi-genre, multi-dialect multi-lingual lexicon for the subjectivity and sentiment analysis of the Arabic language and dialects.

2013

pdf bib
Multiword Expressions in the Context of Statistical Machine Translation
Mahmoud Ghoneim | Mona Diab
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
DIRA: Dialectal Arabic Information Retrieval Assistant
Arfath Pasha | Mohammad Al-Badrashiny | Mohamed Altantawy | Nizar Habash | Manoj Pooleery | Owen Rambow | Ryan M. Roth | Mona Diab
The Companion Volume of the Proceedings of IJCNLP 2013: System Demonstrations

pdf bib
Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media
Weiwei Guo | Hao Li | Heng Ji | Mona Diab
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Sentence Level Dialect Identification in Arabic
Heba Elfardy | Mona Diab
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Reranking with Linguistic and Semantic Features for Arabic Optical Character Recognition
Nadi Tomeh | Nizar Habash | Ryan Roth | Noura Farra | Pradeep Dasigi | Mona Diab
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Identifying Opinion Subgroups in Arabic Online Discussions
Amjad Abu-Jbara | Ben King | Mona Diab | Dragomir Radev
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
ASMA: A System for Automatic Segmentation and Morpho-Syntactic Disambiguation of Modern Standard Arabic
Muhammad Abdul-Mageed | Mona Diab | Sandra Kübler
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity
Mona Diab | Tim Baldwin | Marco Baroni
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

pdf bib
*SEM 2013 shared task: Semantic Textual Similarity
Eneko Agirre | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre | Weiwei Guo
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

pdf bib
Improving Lexical Semantics for Sentential Semantics: Modeling Selectional Preference and Similar Words in a Latent Variable Model
Weiwei Guo | Mona Diab
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Semantic Textual Similarity: past present and future
Mona Diab
Proceedings of the Joint Symposium on Semantic Processing. Textual Inference and Structures in Corpora

2012

pdf bib
Subgroup Detection in Ideological Discussions
Amjad Abu-Jbara | Pradeep Dasigi | Mona Diab | Dragomir Radev
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Modeling Sentences in the Latent Space
Weiwei Guo | Mona Diab
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Genre Independent Subgroup Detection in Online Discussion Threads: A Study of Implicit Attitude using Textual Latent Semantics
Pradeep Dasigi | Weiwei Guo | Mona Diab
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Learning the Latent Semantics of a Concept from its Definition
Weiwei Guo | Mona Diab
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)
Eneko Agirre | Johan Bos | Mona Diab | Suresh Manandhar | Yuval Marton | Deniz Yuret
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
Eneko Agirre | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
Weiwei: A Simple Unsupervised Latent Semantics based Approach for Sentence Similarity
Weiwei Guo | Mona Diab
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
Who’s (Really) the Boss? Perception of Situational Power in Written Interactions
Vinodkumar Prabhakaran | Owen Rambow | Mona Diab
Proceedings of COLING 2012

pdf bib
Token Level Identification of Linguistic Code Switching
Heba Elfardy | Mona Diab
Proceedings of COLING 2012: Posters

pdf bib
Conventional Orthography for Dialectal Arabic
Nizar Habash | Mona Diab | Owen Rambow
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Dialectal Arabic (DA) refers to the day-to-day vernaculars spoken in the Arab world. DA lives side-by-side with the official language, Modern Standard Arabic (MSA). DA differs from MSA on all levels of linguistic representation, from phonology and morphology to lexicon and syntax. Unlike MSA, DA has no standard orthography since there are no Arabic dialect academies, nor is there a large edited body of dialectal literature that follows the same spelling standard. In this paper, we present CODA, a conventional orthography for dialectal Arabic; it is designed primarily for the purpose of developing computational models of Arabic dialects. We explain the design principles of CODA and provide a detailed description of its guidelines as applied to Egyptian Arabic.

pdf bib
Simplified guidelines for the creation of Large Scale Dialectal Arabic Annotations
Heba Elfardy | Mona Diab
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The Arabic language is a collection of dialectal variants along with the standard form, Modern Standard Arabic (MSA). MSA is used in official Settings while the dialectal variants (DA) correspond to the native tongue of the Arabic speakers. Arabic speakers typically code switch between DA and MSA, which is reflected extensively in written online social media. Automatic processing such Arabic genre is very difficult for automated NLP tools since the linguistic difference between MSA and DA is quite profound. However, no annotated resources exist for marking the regions of such switches in the utterance. In this paper, we present a simplified Set of guidelines for detecting code switching in Arabic on the word/token level. We use these guidelines in annotating a corpus that is rich in DA with frequent code switching to MSA. We present both a quantitative and qualitative analysis of the annotations.

pdf bib
Annotations for Power Relations on Email Threads
Vinodkumar Prabhakaran | Huzaifa Neralwala | Owen Rambow | Mona Diab
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Social relations like power and influence are difficult concepts to define, but are easily recognizable when expressed. In this paper, we describe a multi-layer annotation scheme for social power relations that are recognizable from online written interactions. We introduce a typology of four types of power relations between dialog participants: hierarchical power, situational power, influence and control of communication. We also present a corpus of Enron emails comprising of 122 threaded conversations, manually annotated with instances of these power relations between participants. Our annotations also capture attempts at exercise of power or influence and whether those attempts were successful or not. In addition, we also capture utterance level annotations for overt display of power. We describe the annotation definitions using two example email threads from our corpus illustrating each type of power relation. We also present detailed instructions given to the annotators and provide various statistics on annotations in the corpus.

pdf bib
AWATIF: A Multi-Genre Corpus for Modern Standard Arabic Subjectivity and Sentiment Analysis
Muhammad Abdul-Mageed | Mona Diab
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present AWATIF, a multi-genre corpus of Modern Standard Arabic (MSA) labeled for subjectivity and sentiment analysis (SSA) at the sentence level. The corpus is labeled using both regular as well as crowd sourcing methods under three different conditions with two types of annotation guidelines. We describe the sub-corpora constituting the corpus and provide examples from the various SSA categories. In the process, we present our linguistically-motivated and genre-nuanced annotation guidelines and provide evidence showing their impact on the labeling task.

pdf bib
A Pilot PropBank Annotation for Quranic Arabic
Wajdi Zaghouani | Abdelati Hawwari | Mona Diab
Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature

pdf bib
Building an Arabic Multiword Expressions Repository
Abdelati Hawwari | Kfir Bar | Mona Diab
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages

pdf bib
SAMAR: A System for Subjectivity and Sentiment Analysis of Arabic Social Media
Muhammad Abdul-Mageed | Sandra Kuebler | Mona Diab
Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis

pdf bib
Statistical Modality Tagging from Rule-based Annotations and Crowdsourcing
Vinodkumar Prabhakaran | Michael Bloodgood | Mona Diab | Bonnie Dorr | Lori Levin | Christine D. Piatko | Owen Rambow | Benjamin Van Durme
Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics

pdf bib
Predicting Overt Display of Power in Written Dialogs
Vinodkumar Prabhakaran | Owen Rambow | Mona Diab
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Arabic Dialect Processing Tutorial
Mona Diab | Nizar Habash
Tutorial Abstracts at the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2011

pdf bib
CODACT: Towards Identifying Orthographic Variants in Dialectal Arabic
Pradeep Dasigi | Mona Diab
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions
Weiwei Guo | Mona Diab
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Subjectivity and Sentiment Analysis of Modern Standard Arabic
Muhammad Abdul-Mageed | Mona Diab | Mohammed Korayem
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Subjectivity and Sentiment Annotation of Modern Standard Arabic Newswire
Muhammad Abdul-Mageed | Mona Diab
Proceedings of the 5th Linguistic Annotation Workshop

pdf bib
Named Entity Transliteration Generation Leveraging Statistical Machine Translation Technology
Pradeep Dasigi | Mona Diab
Proceedings of the 3rd Named Entities Workshop (NEWS 2011)

pdf bib
Feasibility of Leveraging Crowd Sourcing for the Creation of a Large Scale Annotated Resource for Hindi English Code Switched Data: A Pilot Annotation
Mona Diab | Ankit Kamboj
Proceedings of the 9th Workshop on Asian Language Resources

2010

pdf bib
Combining Orthogonal Monolingual and Multilingual Sources of Evidence for All Words WSD
Weiwei Guo | Mona Diab
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Arabic Named Entity Recognition: Using Features Extracted from Noisy Data
Yassine Benajiba | Imed Zitouni | Mona Diab | Paolo Rosso
Proceedings of the ACL 2010 Conference Short Papers

pdf bib
Task-based Evaluation of Multiword Expressions: a Pilot Study in Statistical Machine Translation
Marine Carpuat | Mona Diab
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
COLEPL and COLSLM: An Unsupervised WSD Approach to Multilingual Lexical Substitution, Tasks 2 and 3 SemEval 2010
Weiwei Guo | Mona Diab
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
Automatic Committed Belief Tagging
Vinodkumar Prabhakaran | Owen Rambow | Mona Diab
Coling 2010: Posters

pdf bib
The Revised Arabic PropBank
Wajdi Zaghouani | Mona Diab | Aous Mansouri | Sameer Pradhan | Martha Palmer
Proceedings of the Fourth Linguistic Annotation Workshop

2009

pdf bib
Handling Sparsity for Verb Noun MWE Token Classification
Mona Diab | Madhav Krishna
Proceedings of the Workshop on Geometrical Models of Natural Language Semantics

pdf bib
Improvements To Monolingual English Word Sense Disambiguation
Weiwei Guo | Mona Diab
Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009)

pdf bib
Verb Noun Construction MWE Token Classification
Mona Diab | Pravin Bhutada
Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009)

pdf bib
Committed Belief Annotation and Tagging
Mona Diab | Lori Levin | Teruko Mitamura | Owen Rambow | Vinodkumar Prabhakaran | Weiwei Guo
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf bib
Who, What, When, Where, Why? Comparing Multiple Approaches to the Cross-Lingual 5W Task
Kristen Parton | Kathleen R. McKeown | Bob Coyne | Mona T. Diab | Ralph Grishman | Dilek Hakkani-Tür | Mary Harper | Heng Ji | Wei Yun Ma | Adam Meyers | Sara Stolbach | Ang Sun | Gokhan Tur | Wei Xu | Sibel Yaman
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf bib
Semantic Role Labeling Systems for Arabic using Kernel Methods
Mona Diab | Alessandro Moschitti | Daniele Pighin
Proceedings of ACL-08: HLT

pdf bib
Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking
Ryan Roth | Owen Rambow | Nizar Habash | Mona Diab | Cynthia Rudin
Proceedings of ACL-08: HLT, Short Papers

pdf bib
A Pilot Arabic Propbank
Martha Palmer | Olga Babko-Malaya | Ann Bies | Mona Diab | Mohamed Maamouri | Aous Mansouri | Wajdi Zaghouani
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, we present the details of creating a pilot Arabic proposition bank (Propbank). Propbanks exist for both English and Chinese. However the morphological and syntactic expression of linguistic phenomena in Arabic yields a very different type of process in creating an Arabic propbank. Hence, we highlight those characteristics of Arabic that make creating a propbank for the language a different challenge compared to the creation of an English Propbank.We believe that many of the lessons learned in dealing with Arabic could generalise to other languages that exhibit equally rich morphology and relatively free word order.

pdf bib
Coling 2008: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing
Irina Matveeva | Chris Biemann | Monojit Choudhury | Mona Diab
Coling 2008: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing

pdf bib
Arabic Named Entity Recognition using Optimized Feature Sets
Yassine Benajiba | Mona Diab | Paolo Rosso
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
Arabic Dialect Processing Tutorial
Mona Diab | Nizar Habash
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Tutorial Abstracts

pdf bib
Improved Arabic Base Phrase Chunking with a new enriched POS tag set
Mona Diab
Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources

pdf bib
Arabic diacritization in the context of statistical machine translation
Mona Diab | Mahmoud Ghoneim | Nizar Habash
Proceedings of Machine Translation Summit XI: Papers

pdf bib
Semi-automatic error analysis for large-scale statistical machine translation
Katrin Kirchhoff | Owen Rambow | Nizar Habash | Mona Diab
Proceedings of Machine Translation Summit XI: Papers

pdf bib
SemEval-2007 Task 18: Arabic Semantic Labeling
Mona Diab | Musa Alkhalifa | Sabry ElKateb | Christiane Fellbaum | Aous Mansouri | Martha Palmer
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
CUNIT: A Semantic Role Labeling System for Modern Standard Arabic
Mona Diab | Alessandro Moschitti | Daniele Pighin
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib
Unsupervised Induction of Modern Standard Arabic Verb Classes
Neal Snider | Mona Diab
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

pdf bib
Arabic Dialect Processing
Mona Diab | Nizar Habash
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Tutorials

pdf bib
Developing and Using a Pilot Dialectal Arabic Treebank
Mohamed Maamouri | Ann Bies | Tim Buckwalter | Mona Diab | Nizar Habash | Owen Rambow | Dalila Tabessi
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we describe the methodological procedures and issues that emerged from the development of a pilot Levantine Arabic Treebank (LATB) at the Linguistic Data Consortium (LDC) and its use at the Johns Hopkins University (JHU) Center for Language and Speech Processing workshop on Parsing Arabic Dialects (PAD). This pilot, consisting of morphological and syntactic annotation of approximately 26,000 words of Levantine Arabic conversational telephone speech, was developed under severe time constraints; hence the LDC team drew on their experience in treebanking Modern Standard Arabic (MSA) text. The resulting Levantine dialect treebanked corpus was used by the PAD team to develop and evaluate parsers for Levantine dialect texts. The parsers were trained on MSA resources and adapted using dialect-MSA lexical resources (some developed especially for this task) and existing linguistic knowledge about syntactic differences between MSA and dialect. The use of the LATB for development and evaluation of syntactic parsers allowed the PAD team to provide feedbasck to the LDC treebank developers. In this paper, we describe the creation of resources for this corpus, as well as transformations on the corpus to eliminate speech effects and lessen the gap between our pre-existing MSA resources and the new dialectal corpus

pdf bib
Parsing Arabic Dialects
David Chiang | Mona Diab | Nizar Habash | Owen Rambow | Safiullah Shareef
11th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Unsupervised Induction of Modern Standard Arabic Verb Classes Using Syntactic Frames and LSA
Neal Snider | Mona Diab
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf bib
Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Kareem Darwish | Mona Diab | Nizar Habash
Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages

2004

pdf bib
Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks
Mona Diab | Kadri Hacioglu | Daniel Jurafsky
Proceedings of HLT-NAACL 2004: Short Papers

pdf bib
An Unsupervised Approach for Bootstrapping Arabic Sense Tagging
Mona T. Diab
Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages

pdf bib
Relieving the data Acquisition Bottleneck in Word Sense Disambiguation
Mona Diab
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

2002

pdf bib
An Unsupervised Method for Word Sense Tagging using Parallel Corpora
Mona Diab | Philip Resnik
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

2000

pdf bib
An Unsupervised Method for Multilingual Word Sense Tagging Using Parallel Corpora
Mona Diab
ACL-2000 Workshop on Word Senses and Multi-linguality

Search
Co-authors