Mona Diab

Also published as: Mona T. Diab

2025

pdf bib abs
BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data
Wenkai Li | Jiarui Liu | Andy Liu | Xuhui Zhou | Mona T. Diab | Maarten Sap
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this work, we tackle the challenge of embedding realistic human personality traits into LLMs. Previous approaches have primarily focused on prompt-based methods that describe the behavior associated with the desired personality traits, suffering from realism and validity issues. To address these limitations, we introduce BIG5-CHAT, a large-scale dataset containing 100,000 dialogues designed to ground models in how humans express their personality in text. Leveraging this dataset, we explore Supervised Fine-Tuning and Direct Preference Optimization as training-based methods to align LLMs more naturally with human personality patterns. Our methods outperform prompting on personality assessments such as BFI and IPIP-NEO, with trait correlations more closely matching human data. Furthermore, our experiments reveal that models trained to exhibit higher conscientiousness, higher agreeableness, lower extraversion, and lower neuroticism display better performance on reasoning tasks, aligning with psychological findings on how these traits impact human cognitive performance. To our knowledge, this work is the first comprehensive study to demonstrate how training-based methods can shape LLM personalities through learning from real human behaviors.

pdf bib abs
Humanizing Machines: Rethinking LLM Anthropomorphism Through a Multi-Level Framework of Design
Yunze Xiao | Lynnette Hui Xian Ng | Jiarui Liu | Mona T. Diab
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) increasingly exhibit anthropomorphism characteristics – human-like qualities portrayed across their outlook, language, behavior, and reasoning functions. Such characteristics enable more intuitive and engaging human-AI interactions. However, current research on anthropomorphism remains predominantly risk-focused, emphasizing over-trust and user deception while offering limited design guidance. We argue that anthropomorphism should instead be treated as a concept of design that can be intentionally tuned to support user goals. Drawing from multiple disciplines, we propose that the anthropomorphism of an LLM-based artifact should reflect the interaction between artifact designers and interpreters. This interaction is facilitated by cues embedded in the artifact by the designers and the (cognitive) responses of the interpreters to the cues. Cues are categorized into four dimensions: perceptive, linguistic, behavioral, and cognitive. By analyzing the manifestation and effectiveness of each cue, we provide a unified taxonomy with actionable levers for practitioners. Consequently, we advocate for function-oriented evaluations of anthropomorphic design.

As large language models (LLMs) are increasingly used in morally sensitive domains, it is crucial to understand how persona traits affect their moral reasoning and persuasive behavior. We present the first large-scale study of multi-dimensional persona effects in AI-AI debates over real-world moral dilemmas. Using a 6-dimensional persona space (age, gender, country, social class, ideology, and personality), we simulate structured debates between AI agents over 131 relationship-based cases. Our results show that personas affect initial moral stances and debate outcomes, with political ideology and personality traits exerting the strongest influence. Persuasive success varies across traits, with liberal and open personalities reaching higher consensus. While logit-based confidence grows during debates, emotional and credibility-based appeals diminish, indicating more tempered argumentation over time. These trends mirror findings from psychology and cultural studies, reinforcing the need for persona-aware evaluation frameworks for AI moral reasoning.

pdf bib abs
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models
Aashiq Muhamed | Mona T. Diab | Virginia Smith
Findings of the Association for Computational Linguistics: NAACL 2025

Understanding and mitigating the potential risks associated with foundation models (FMs) hinges on developing effective interpretability methods. Sparse Autoencoders (SAEs) have emerged as a promising tool for disentangling FM representations, but they struggle to capture rare, yet crucial concepts in the data. We introduce Specialized Sparse Autoencoders (SSAEs), designed to illuminate these elusive dark matter features by focusing on specific subdomains. We present a practical recipe for training SSAEs, demonstrating the efficacy of dense retrieval for data selection and the benefits of Tilted Empirical Risk Minimization as a training objective to improve concept recall. Our evaluation of SSAEs on standard metrics, such as downstream perplexity and L₀ sparsity, show that they effectively capture subdomain tail concepts, exceeding the capabilities of general-purpose SAEs. We showcase the practical utility of SSAEs in a case study on the Bias in Bios dataset, where SSAEs achieve a 12.5% increase in worst-group classification accuracy over the pretrained general-purpose SAE when applied to remove spurious gender information. SSAEs provide a powerful new lens for peering into the inner workings of FMs in subdomains.

pdf bib abs
Biases Propagate in Encoder-based Vision-Language Models: A Systematic Analysis From Intrinsic Measures to Zero-shot Retrieval Outcomes
Kshitish Ghate | Tessa Charlesworth | Mona T. Diab | Aylin Caliskan
Findings of the Association for Computational Linguistics: ACL 2025

To build fair AI systems we need to understand how social-group biases intrinsic to foundational encoder-based vision-language models (VLMs) manifest in biases in downstream tasks. In this study, we demonstrate that intrinsic biases in VLM representations systematically “carry over” or propagate into zero-shot retrieval tasks, revealing how deeply rooted biases shape a model’s outputs. We introduce a controlled framework to measure this propagation by correlating (a) intrinsic measures of bias in the representational space with (b) extrinsic measures of bias in zero-shot text-to-image (TTI) and image-to-text (ITT) retrieval. Results show substantial correlations between intrinsic and extrinsic bias, with an average 𝜌 = 0.83 ± 0.10. This pattern is consistent across 114 analyses, both retrieval directions, six social groups, and three distinct VLMs. Notably, we find that larger/better-performing models exhibit greater bias propagation, a finding that raises concerns given the trend towards increasingly complex AI models. Our framework introduces baseline evaluation tasks to measure the propagation of group and valence signals. Investigations reveal that underrepresented groups experience less robust propagation, further skewing their model-related outcomes.

The field of machine translation has achieved significant advancements, yet domain-specific terminology translation, particularly in AI, remains challenging. This work introduces GIST, a large-scale multilingual AI terminology dataset containing 5K terms extracted from top AI conference papers spanning 2000 to 2023. The terms were translated into Arabic, Chinese, French, Japanese, and Russian using a hybrid framework that combines LLMs for extraction with human expertise for translation. The dataset’s quality was benchmarked against existing resources, demonstrating superior translation accuracy through crowdsourced evaluation. GIST was integrated into translation workflows using post-translation refinement methods that required no retraining, where LLM prompting consistently improved BLEU and COMET scores. A web demonstration on the ACL Anthology platform highlights its practical application, showcasing improved accessibility for non-English speakers. We address a critical gap in AI terminology resources and fosters global inclusivity and collaboration in AI research.

pdf bib abs
From Complexity to Clarity: AI/NLP’s Role in Regulatory Compliance
Jivitesh Jain | Nivedhitha Dhanasekaran | Mona T. Diab
Findings of the Association for Computational Linguistics: ACL 2025

Regulatory data compliance is a cornerstone of trust and accountability in critical sectors like finance, healthcare, and technology, yet its complexity poses significant challenges for organizations worldwide. Recent advances in natural language processing, particularly large language models, have demonstrated remarkable capabilities in text analysis and reasoning, offering promising solutions for automating compliance processes. This survey examines the current state of automated data compliance, analyzing key challenges and approaches across problem areas. We identify critical limitations in current datasets and techniques, including issues of adaptability, completeness, and trust. Looking ahead, we propose research directions to address these challenges, emphasizing standardized evaluation frameworks and balanced human-AI collaboration.

pdf bib abs
SimBA: Simplifying Benchmark Analysis Using Performance Matrices Alone
Nishant Subramani | Alfredo Gomez | Mona T. Diab
Findings of the Association for Computational Linguistics: EMNLP 2025

Modern language models are evaluated on large benchmarks, which are difficult to make sense of, especially for model selection.Looking at the raw evaluation numbers themselves using a model-centric lens, we propose SimBA, a three phase framework to Simplify Benchmark Analysis. The three phases of SimBA are: stalk, where we conduct dataset & model comparisons, prowl, where we discover a representative subset, and pounce, where we use the representative subset to predict performance on a held-out set of models. Applying SimBA to three popular LM benchmarks: HELM, MMLU, and BigBenchLite reveals that across all three benchmarks, datasets and models relate strongly to one another (stalk). We develop an representative set discovery algorithm which covers a benchmark using raw evaluation scores alone. Using our algorithm, we find that with 6.25% (1/16), 1.7% (1/58), and 28.4% (21/74) of the datasets for HELM, MMLU, and BigBenchLite respectively, we achieve coverage levels of at least 95% (prowl). Additionally, using just these representative subsets, we can both preserve model ranks and predict performance on a held-out set of models with near zero mean-squared error (pounce). Taken together, SimBA can help model developers improve efficiency during model training and dataset creators validate whether their newly created dataset differs from existing datasets in a benchmark. Our code is open source, available at https://github.com/nishantsubramani/simba.

pdf bib abs
Intrinsic Bias is Predicted by Pretraining Data and Correlates with Downstream Performance in Vision-Language Encoders
Kshitish Ghate | Isaac Slaughter | Kyra Wilson | Mona T. Diab | Aylin Caliskan
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

While recent work has found that vision-language models trained under the Contrastive Language Image Pre-training (CLIP) framework contain intrinsic social biases, the extent to which different upstream pre-training features of the framework relate to these biases, and hence how intrinsic bias and downstream performance are connected has been unclear. In this work, we present the largest comprehensive analysis to-date of how the upstream pre-training factors and downstream performance of CLIP models relate to their intrinsic biases. Studying 131 unique CLIP models, trained on 26 datasets, using 55 architectures, and in a variety of sizes, we evaluate bias in each model using 26 well-established unimodal and cross-modal principled Embedding Association Tests. We find that the choice of pre-training dataset is the most significant upstream predictor of bias, whereas architectural variations have minimal impact. Additionally, datasets curated using sophisticated filtering techniques aimed at enhancing downstream model performance tend to be associated with higher levels of intrinsic bias. Finally, we observe that intrinsic bias is often significantly correlated with downstream performance (0.3 ≤ r ≤ 0.8), suggesting that models optimized for performance inadvertently learn to amplify representational biases. Comparisons between unimodal and cross-modal association tests reveal that social group bias depends heavily on the modality. Our findings imply that more sophisticated strategies are needed to address intrinsic model bias for vision-language models across the entire model development pipeline.

pdf bib abs
CoRAG: Collaborative Retrieval-Augmented Generation
Aashiq Muhamed | Mona T. Diab | Virginia Smith
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Retrieval-Augmented Generation (RAG) models excel in knowledge-intensive tasks, especially under few-shot learning constraints. We introduce CoRAG, a framework extending RAG to collaborative settings, where clients jointly train a shared model using a collaborative passage store. To evaluate CoRAG, we introduce CRAB, a benchmark for collaborative homogeneous open-domain question answering. Our experiments demonstrate that CoRAG consistently outperforms both parametric collaborative learning methods and locally trained RAG models in low-resource scenarios. Further analysis reveals the critical importance of relevant passages within the shared store, the surprising benefits of incorporating irrelevant passages, and the potential for hard negatives to negatively impact performance. This introduces a novel consideration in collaborative RAG: the trade-off between leveraging a collectively enriched knowledge base and the potential risk of incorporating detrimental passages from other clients. Our findings underscore the viability of CoRAG, while also highlighting key design challenges and promising avenues for future research.

2024

pdf bib abs
Investigating Cultural Alignment of Large Language Models
Badr AlKhamissi | Muhammad ElNokrashy | Mai Alkhamissi | Mona Diab
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The intricate relationship between language and culture has long been a subject of exploration within the realm of linguistic anthropology. Large Language Models (LLMs), promoted as repositories of collective human knowledge, raise a pivotal question: do these models genuinely encapsulate the diverse knowledge adopted by different cultures? Our study reveals that these models demonstrate greater cultural alignment along two dimensions—firstly, when prompted with the dominant language of a specific culture, and secondly, when pretrained with a refined mixture of languages employed by that culture. We quantify cultural alignment by simulating sociological surveys, comparing model responses to those of actual survey participants as references. Specifically, we replicate a survey conducted in various regions of Egypt and the United States through prompting LLMs with different pretraining data mixtures in both Arabic and English with the personas of the real respondents and the survey questions. Further analysis reveals that misalignment becomes more pronounced for underrepresented personas and for culturally sensitive topics, such as those probing social values. Finally, we introduce Anthropological Prompting, a novel method leveraging anthropological reasoning to enhance cultural alignment. Our study emphasizes the necessity for a more balanced multilingual pretraining dataset to better represent the diversity of human experience and the plurality of different cultures with many implications on the topic of cross-lingual transfer.

We present an overview of the FIGNEWSshared task, organized as part of the Arabic-NLP 2024 conference co-located with ACL2024. The shared task addresses bias and pro-paganda annotation in multilingual news posts.We focus on the early days of the Israel War onGaza as a case study. The task aims to fostercollaboration in developing annotation guide-lines for subjective tasks by creating frame-works for analyzing diverse narratives high-lighting potential bias and propaganda. In aspirit of fostering and encouraging diversity,we address the problem from a multilingualperspective, namely within five languages: En-glish, French, Arabic, Hebrew, and Hindi. Atotal of 17 teams participated in two annota-tion subtasks: bias (16 teams) and propaganda(6 teams). The teams competed in four evalua-tion tracks: guidelines development, annotationquality, annotation quantity, and consistency.Collectively, the teams produced 129,800 datapoints. Key findings and implications for thefield are discussed.

pdf bib
GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients
Aashiq Muhamed | Oscar Li | David Woodruff | Mona T. Diab | Virginia Smith
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

pdf bib abs
Evaluating Large Language Model Biases in Persona-Steered Generation
Andy Liu | Mona Diab | Daniel Fried
Findings of the Association for Computational Linguistics: ACL 2024

The task of persona-steered text generation requires large language models (LLMs) to generate text that reflects the distribution of views that an individual fitting a persona could have. People have multifaceted personas, but prior work on bias in LLM-generated opinions has only explored multiple-choice settings or one-dimensional personas. We define an incongruous persona as a persona with multiple traits where one trait makes its other traits less likely in human survey data, e.g. political liberals who support increased military spending. We find that LLMs are 9.7% less steerable towards incongruous personas than congruous ones, sometimes generating the stereotypical stance associated with its demographic rather than the target stance. Models that we evaluate that are fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable, especially towards stances associated with political liberals and women, but present significantly less diverse views of personas. We also find variance in LLM steerability that cannot be predicted from multiple-choice opinion evaluation. Our results show the importance of evaluating models in open-ended text generation, as it can surface new LLM opinion biases. Moreover, such a setup can shed light on our ability to steer models toward a richer and more diverse range of viewpoints.

pdf bib abs
Semantic Compression for Word and Sentence Embeddings using Discrete Wavelet Transform
Rana Salama | Abdou Youssef | Mona Diab
Findings of the Association for Computational Linguistics: ACL 2024

Wavelet transforms, a powerful mathematical tool, have been widely used in different domains, including Signal and Image processing, to unravel intricate patterns, enhance data representation, and extract meaningful features from data. Tangible results from their application suggest that Wavelet transforms can be applied to NLP capturing a variety of linguistic and semantic properties.In this paper, we empirically leverage the application of Discrete Wavelet Transforms (DWT) to word and sentence embeddings. We aim to showcase the capabilities of DWT in analyzing embedding representations at different levels of resolution and compressing them while maintaining their overall quality.We assess the effectiveness of DWT embeddings on semantic similarity tasks to show how DWT can be used to consolidate important semantic information in an embedding vector. We show the efficacy of the proposed paradigm using different embedding models, including large language models, on downstream tasks. Our results show that DWT can reduce the dimensionality of embeddings by 50-93% with almost no change in performance for semantic similarity tasks, while achieving superior accuracy in most downstream tasks. Our findings pave the way for applying DWT to improve NLP applications.

pdf bib abs
Depth-Wise Attention (DWAtt): A Layer Fusion Method for Data-Efficient Classification
Muhammad ElNokrashy | Badr AlKhamissi | Mona Diab
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Language Models pretrained on large textual data have been shown to encode different types of knowledge simultaneously. Traditionally, only the features from the last layer are used when adapting to new tasks or data. We put forward that, when using or finetuning deep pretrained models, intermediate layer features that may be relevant to the downstream task are buried too deep to be used efficiently in terms of needed samples or steps. To test this, we propose a new layer fusion method: Depth-Wise Attention (DWAtt), to help re-surface signals from non-final layers. We compare DWAtt to a basic concatenation-based layer fusion method (Concat), and compare both to a deeper model baseline—all kept within a similar parameter budget. Our findings show that DWAtt and Concat are more step- and sample-efficient than the baseline, especially in the few-shot setting. DWAtt outperforms Concat on larger data sizes. On CoNLL-03 NER, layer fusion shows 3.68 − 9.73% F1 gain at different few-shot sizes. The layer fusion models presented significantly outperform the baseline in various training scenarios with different data sizes, architectures, and training constraints.

pdf bib abs
Automatic Generation of Model and Data Cards: A Step Towards Responsible AI
Jiarui Liu | Wenkai Li | Zhijing Jin | Mona Diab
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

In an era of model and data proliferation in machine learning/AI especially marked by the rapid advancement of open-sourced technologies, there arises a critical need for standardized consistent documentation. Our work addresses the information incompleteness in current human-written model and data cards. We propose an automated generation approach using Large Language Models (LLMs). Our key contributions include the establishment of CardBench, a comprehensive dataset aggregated from over 4.8k model cards and 1.4k data cards, coupled with the development of the CardGen pipeline comprising a two-step retrieval process. Our approach exhibits enhanced completeness, objectivity, and faithfulness in generated model and data cards, a significant step in responsible AI documentation practices ensuring better accountability and traceability.

pdf bib abs
Analyzing the Role of Semantic Representations in the Era of Large Language Models
Zhijing Jin | Yuen Chen | Fernando Gonzalez Adauto | Jiarui Liu | Jiayi Zhang | Julian Michael | Bernhard Schölkopf | Mona Diab
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Traditionally, natural language processing (NLP) models often use a rich set of features created by linguistic expertise, such as semantic representations. However, in the era of large language models (LLMs), more and more tasks are turned into generic, end-to-end sequence generation problems. In this paper, we investigate the question: what is the role of semantic representations in the era of LLMs? Specifically, we investigate the effect of Abstract Meaning Representation (AMR) across five diverse NLP tasks. We propose an AMR-driven chain-of-thought prompting method, which we call AMRCOT, and find that it generally hurts performance more than it helps. To investigate what AMR may have to offer on these tasks, we conduct a series of analysis experiments. We find that it is difficult to predict which input examples AMR may help or hurt on, but errors tend to arise with multi-word expressions, named entities, and in the final inference step where the LLM must connect its reasoning over the AMR to its prediction. We recommend focusing on these areas for future work in semantic representations for LLMs. Our code: https://github.com/causalNLP/amr_llm

2023

Recent advancements in large language models have enabled them to perform well on complex tasks that require step-by-step reasoning with few-shot learning. However, it is unclear whether these models are applying reasoning skills they have learnt during pre-training , or if they are simply memorizing their training corpus at finer granularity and have learnt to better understand their context. To address this question, we introduce {pasted macro ‘OUR’}model, a benchmark and suite of analyses for evaluating reasoning skills of language models. {pasted macro ‘OUR’}model enables comparing pre-trained and finetuned models on complex tasks that require reasoning skills to solve. Our benchmark provides a test bed to asses any language model on fine-grained reasoning skills, which spans over 20 datasets and covers 10 different reasoning skills. By using {pasted macro ‘OUR’}model we further investigate the role of finetuning. Our extensive empirical analysis shows that language models learn more reasoning skills such as textual entailment, abductive reasoning, and analogical reasoning during the finetuning stage compared to pretraining stage. However, we also find that when language models are finetuned they tend to overfit to the prompt template, which hurts the robustness of models causing generalization problems.

pdf bib abs
Care4Lang at MEDIQA-Chat 2023: Fine-tuning Language Models for Classifying and Summarizing Clinical Dialogues
Amal Alqahtani | Rana Salama | Mona Diab | Abdou Youssef
Proceedings of the 5th Clinical Natural Language Processing Workshop

Summarizing medical conversations is one of the tasks proposed by MEDIQA-Chat to promote research on automatic clinical note generation from doctor-patient conversations. In this paper, we present our submission to this task using fine-tuned language models, including T5, BART and BioGPT models. The fine-tuned models are evaluated using ensemble metrics including ROUGE, BERTScore andBLEURT. Among the fine-tuned models, Flan-T5 achieved the highest aggregated score for dialogue summarization.

Language models can memorize a considerable amount of factual information during pretraining that can be elicited through prompting or finetuning models on tasks like question answering. In this paper, we discuss approaches to measuring model factual beliefs, updating incorrect factual beliefs in models, and visualizing graphical relationships between factual beliefs. Our main contributions include: (1) new metrics for evaluating belief-updating methods focusing on the logical consistency of beliefs, (2) a training objective for Sequential, Local, and Generalizing updates (SLAG) that improves the performance of existing hypernetwork approaches, and (3) the introduction of the belief graph, a new form of visualization for language models that shows relationships between stored model beliefs. Our experiments suggest that models show only limited consistency between factual beliefs, but update methods can both fix incorrect model beliefs and greatly improve their consistency. Although off-the-shelf optimizers are surprisingly strong belief-updating baselines, our learned optimizers can outperform them in more difficult settings than have been considered in past work.

pdf bib abs
Evaluating Multilingual Speech Translation under Realistic Conditions with Resegmentation and Terminology
Elizabeth Salesky | Kareem Darwish | Mohamed Al-Badrashiny | Mona Diab | Jan Niehues
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

We present the ACL 60/60 evaluation sets for multilingual translation of ACL 2022 technical presentations into 10 target languages. This dataset enables further research into multilingual speech translation under realistic recording conditions with unsegmented audio and domain-specific terminology, applying NLP tools to text and speech in the technical domain, and evaluating and improving model robustness to diverse speaker demographics.

pdf bib abs
OPT-R: Exploring the Role of Explanations in Finetuning and Prompting for Reasoning Skills of Large Language Models
Badr Alkhamissi | Siddharth Verma | Ping Yu | Zhijing Jin | Asli Celikyilmaz | Mona Diab
Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE)

We conduct a thorough investigation into the reasoning capabilities of Large Language Models (LLMs), focusing specifically on the Open Pretrained Transformers (OPT) models as a representative of such models. Our study entails finetuning three different sizes of OPT on a carefully curated reasoning corpus, resulting in two sets of finetuned models: OPT-R, finetuned without explanations, and OPT-RE, finetuned with explanations. We then evaluate all models on 57 out-of-domain tasks drawn from the Super-NaturalInstructions benchmark, covering 26 distinct reasoning skills, utilizing three prompting techniques. Through a comprehensive grid of 27 configurations and 6,156 test evaluations, we investigate the dimensions of finetuning, prompting, and scale to understand the role of explanations on different reasoning skills. Our findings reveal that having explanations in the fewshot exemplar has no significant impact on the model’s performance when the model is finetuned, while positively affecting the non-finetuned counterpart. Moreover, we observe a slight yet consistent increase in classification accuracy as we incorporate explanations during prompting and finetuning, respectively. Finally, we offer insights on which reasoning skills benefit the most from incorporating explanations during finetuning and prompting, such as Numerical (+20.4%) and Analogical (+13.9%) reasoning, as well as skills that exhibit negligible or negative effects.

pdf bib abs
Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values
Yejin Bang | Tiezheng Yu | Andrea Madotto | Zhaojiang Lin | Mona Diab | Pascale Fung
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)

Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values. Yet, human values can vary under diverse cultural conditions. Therefore, we introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command. Along with the task, we propose a practical approach that distills value-aligned knowledge from large-scale language models (LLMs) to construct value-aligned classifiers in two steps. First, we generate value-aligned training data from LLMs by prompt-based few-shot learning. Next, we fine-tune smaller classification models with the generated data for the task. Empirical results show that our VA-Models surpass multiple baselines by at least 15.56% on the F1-score, including few-shot learning with OPT-175B and existing text augmentation methods. We suggest that using classifiers with explicit human value input improves both inclusivity & explainability in AI.

2022

pdf bib abs
Text Characterization Toolkit (TCT)
Daniel Simig | Tianlu Wang | Verna Dankers | Peter Henderson | Khuyagbaatar Batsuren | Dieuwke Hupkes | Mona Diab
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations

We present a tool, Text Characterization Toolkit (TCT), that researchers can use to study characteristics of large datasets. Furthermore, such properties can lead to understanding the influence of such attributes on models’ behaviour. Traditionally, in most NLP research, models are usually evaluated by reporting single-number performance scores on a number of readily available benchmarks, without much deeper analysis. Here, we argue that – especially given the well-known fact that benchmarks often contain biases, artefacts, and spurious correlations – deeper results analysis should become the de-facto standard when presenting new models or benchmarks. TCT aims at filling this gap by facilitating such deeper analysis for datasets at scale, where datasets can be for training/development/evaluation. TCT includes both an easy-to-use tool, as well as off-the-shelf scripts that can be used for specific analyses. We also present use-cases from several different domains. TCT is used to predict difficult examples for given well-known trained models; TCT is also used to identify (potentially harmful) biases present in a dataset.

pdf bib abs
Consistent Human Evaluation of Machine Translation across Language Pairs
Daniel Licht | Cynthia Gao | Janice Lam | Francisco Guzman | Mona Diab | Philipp Koehn
Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Obtaining meaningful quality scores for machine translation systems through human evaluation remains a challenge given the high variability between human evaluators, partly due to subjective expectations for translation quality for different language pairs. We propose a new metric called XSTS that is more focused on semantic equivalence and a cross-lingual calibration method that enables more consistent assessment. We demonstrate the effectiveness of these novel contributions in large scale evaluation studies across up to 14 language pairs, with translation both into and out of English.

pdf bib abs
Knowledge-Augmented Language Models for Cause-Effect Relation Classification
Pedram Hosseini | David A. Broniatowski | Mona Diab
Proceedings of the First Workshop on Commonsense Representation and Reasoning (CSRR 2022)

Previous studies have shown the efficacy of knowledge augmentation methods in pretrained language models. However, these methods behave differently across domains and downstream tasks. In this work, we investigate the augmentation of pretrained language models with knowledge graph data in the cause-effect relation classification and commonsense causal reasoning tasks. After automatically verbalizing triples in ATOMIC2020, a wide coverage commonsense reasoning knowledge graph, we continually pretrain BERT and evaluate the resulting model on cause-effect pair classification and answering commonsense causal reasoning questions. Our results show that a continually pretrained language model augmented with commonsense reasoning knowledge outperforms our baselines on two commonsense causal reasoning benchmarks, COPA and BCOPA-CE, and a Temporal and Causal Reasoning (TCR) dataset, without additional improvement in model architecture or using quality-enhanced data for fine-tuning.

Hate speech detection is complex; it relies on commonsense reasoning, knowledge of stereotypes, and an understanding of social nuance that differs from one culture to the next. It is also difficult to collect a large-scale hate speech annotated dataset. In this work, we frame this problem as a few-shot learning task, and show significant gains with decomposing the task into its “constituent” parts. In addition, we see that infusing knowledge from reasoning datasets (e.g. ATOMIC2020) improves the performance even further. Moreover, we observe that the trained models generalize to out-of-distribution datasets, showing the superiority of task decomposition and knowledge infusion compared to previously used methods. Concretely, our method outperforms the baseline by 17.83% absolute gain in the 16-shot case.

Large-scale generative language models such as GPT-3 are competitive few-shot learners. While these models are known to be able to jointly represent many different languages, their training data is dominated by English, potentially limiting their cross-lingual generalization. In this work, we train multilingual generative language models on a corpus covering a diverse set of languages, and study their few- and zero-shot learning capabilities in a wide range of tasks. Our largest model with 7.5 billion parameters sets new state of the art in few-shot learning in more than 20 representative languages, outperforming GPT-3 of comparable size in multilingual commonsense reasoning (with +7.4% absolute accuracy improvement in 0-shot settings and +9.4% in 4-shot settings) and natural language inference (+5.4% in each of 0-shot and 4-shot settings). On the FLORES-101 machine translation benchmark, our model outperforms GPT-3 on 171 out of 182 directions with 32 training examples, while surpassing the official supervised baseline in 45 directions. We conduct an in-depth analysis of different multilingual prompting approaches, showing in particular that strong few-shot learning performance across languages can be achieved via cross-lingual transfer through both templates and demonstration examples.

Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full-shot fine-tuning. With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the performance of dense models using ~4 times less compute. This gap narrows at scale, but our largest MoE model (1.1T parameters) consistently outperforms a compute-equivalent dense model (6.7B parameters). Overall, this performance gap varies greatly across tasks and domains, suggesting that MoE and dense models generalize differently in ways that are worthy of future study. We make our code and models publicly available for research use.

pdf bib abs
Towards Responsible Natural Language Annotation for the Varieties of Arabic
A. Bergman | Mona Diab
Findings of the Association for Computational Linguistics: ACL 2022

When building NLP models, there is a tendency to aim for broader coverage, often overlooking cultural and (socio)linguistic nuance. In this position paper, we make the case for care and attention to such nuances, particularly in dataset annotation, as well as the inclusion of cultural and linguistic expertise in the process. We present a playbook for responsible dataset creation for polyglossic, multidialectal languages. This work is informed by a study on Arabic annotation of social media content.

pdf bib abs
A Quantitative and Qualitative Analysis of Schizophrenia Language
Amal Alqahtani | Efsun Sarioglu Kayi | Sardar Hamidian | Michael Compton | Mona Diab
Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)

Schizophrenia is one of the most disabling mental health conditions to live with. Approximately one percent of the population has schizophrenia which makes it fairly common, and it affects many people and their families. Patients with schizophrenia suffer different symptoms: formal thought disorder (FTD), delusions, and emotional flatness. In this paper, we quantitatively and qualitatively analyze the language of patients with schizophrenia measuring various linguistic features in two modalities: speech and written text. We examine the following features: coherence and cohesion of thoughts, emotions, specificity, level of commit- ted belief (LCB), and personality traits. Our results show that patients with schizophrenia score high in fear and neuroticism compared to healthy controls. In addition, they are more committed to their beliefs, and their writing lacks details. They score lower in most of the linguistic features of cohesion with significant p-values.

We present the BeSt corpus, which records cognitive state: who believes what (i.e., factuality), and who has what sentiment towards what. This corpus is inspired by similar source-and-target corpora, specifically MPQA and FactBank. The corpus comprises two genres, newswire and discussion forums, in three languages, Chinese (Mandarin), English, and Spanish. The corpus is distributed through the LDC.

pdf bib abs
AnswerSumm: A Manually-Curated Dataset and Pipeline for Answer Summarization
Alexander Fabbri | Xiaojian Wu | Srini Iyer | Haoran Li | Mona Diab
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Community Question Answering (CQA) fora such as Stack Overflow and Yahoo! Answers contain a rich resource of answers to a wide range of community-based questions. Each question thread can receive a large number of answers with different perspectives. One goal of answer summarization is to produce a summary that reflects the range of answer perspectives. A major obstacle for this task is the absence of a dataset to provide supervision for producing such summaries. Recent works propose heuristics to create such data, but these are often noisy and do not cover all answer perspectives present. This work introduces a novel dataset of 4,631 CQA threads for answer summarization curated by professional linguists. Our pipeline gathers annotations for all subtasks of answer summarization, including relevant answer sentence selection, grouping these sentences based on perspectives, summarizing each perspective, and producing an overall summary. We analyze and benchmark state-of-the-art models on these subtasks and introduce a novel unsupervised approach for multi-perspective data augmentation that boosts summarization performance according to automatic evaluation. Finally, we propose reinforcement learning rewards to improve factual consistency and answer coverage and analyze areas for improvement.

pdf bib abs
Meta AI at Arabic Hate Speech 2022: MultiTask Learning with Self-Correction for Hate Speech Classification
Badr AlKhamissi | Mona Diab
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection

In this paper, we tackle the Arabic Fine-Grained Hate Speech Detection shared task and demonstrate significant improvements over reported baselines for its three subtasks. The tasks are to predict if a tweet contains (1) Offensive language; and whether it is considered (2) Hate Speech or not and if so, then predict the (3) Fine-Grained Hate Speech label from one of six categories. Our final solution is an ensemble of models that employs multitask learning and a self-consistency correction method yielding 82.7% on the hate speech subtask—reflecting a 3.4% relative improvement compared to previous work.

pdf bib abs
GisPy: A Tool for Measuring Gist Inference Score in Text
Pedram Hosseini | Christopher Wolfe | Mona Diab | David Broniatowski
Proceedings of the 4th Workshop of Narrative Understanding (WNU2022)

Decision making theories such as Fuzzy-Trace Theory (FTT) suggest that individuals tend to rely on gist, or bottom-line meaning, in the text when making decisions. In this work, we delineate the process of developing GisPy, an opensource tool in Python for measuring the Gist Inference Score (GIS) in text. Evaluation of GisPy on documents in three benchmarks from the news and scientific text domains demonstrates that scores generated by our tool significantly distinguish low vs. high gist documents. Our tool is publicly available to use at: https: //github.com/phosseini/GisPy.

2021

pdf bib abs
Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data
Wei-Jen Ko | Ahmed El-Kishky | Adithya Renduchintala | Vishrav Chaudhary | Naman Goyal | Francisco Guzmán | Pascale Fung | Philipp Koehn | Mona Diab
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data, in addition to any parallel data in the related high-resource language. Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation. We experiment on 7 languages from three different language families and show that our technique significantly improves translation into low-resource language compared to other translation baselines.

pdf bib abs
Gender bias amplification during Speed-Quality optimization in Neural Machine Translation
Adithya Renduchintala | Denise Diaz | Kenneth Heafield | Xian Li | Mona Diab
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Is bias amplified when neural machine translation (NMT) models are optimized for speed and evaluated on generic test sets using BLEU? We investigate architectures and techniques commonly used to speed up decoding in Transformer-based models, such as greedy search, quantization, average attention networks (AANs) and shallow decoder models and show their effect on gendered noun translation. We construct a new gender bias test set, SimpleGEN, based on gendered noun phrases in which there is a single, unambiguous, correct answer. While we find minimal overall BLEU degradation as we apply speed optimizations, we observe that gendered noun translation performance degrades at a much faster rate.

pdf bib abs
Discrete Cosine Transform as Universal Sentence Encoder
Nada Almarwani | Mona Diab
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Modern sentence encoders are used to generate dense vector representations that capture the underlying linguistic characteristics for a sequence of words, including phrases, sentences, or paragraphs. These kinds of representations are ideal for training a classifier for an end task such as sentiment analysis, question answering and text classification. Different models have been proposed to efficiently generate general purpose sentence representations to be used in pretraining protocols. While averaging is the most commonly used efficient sentence encoder, Discrete Cosine Transform (DCT) was recently proposed as an alternative that captures the underlying syntactic characteristics of a given text without compromising practical efficiency compared to averaging. However, as with most other sentence encoders, the DCT sentence encoder was only evaluated in English. To this end, we utilize DCT encoder to generate universal sentence representation for different languages such as German, French, Spanish and Russian. The experimental results clearly show the superior effectiveness of DCT encoding in which consistent performance improvements are achieved over strong baselines on multiple standardized datasets

pdf bib abs
Active Learning for Rumor Identification on Social Media
Parsa Farinneya | Mohammad Mahdi Abdollah Pour | Sardar Hamidian | Mona Diab
Findings of the Association for Computational Linguistics: EMNLP 2021

Social media has emerged as a key channel for seeking information. Online users spend several hours reading, posting, and searching for news on microblogging platforms daily. However, this could act as a double-edged sword especially when not all information online is reliable. Moreover, the inherently unmoderated nature of social media renders identifying unverified information ever more challenging. Most of the existing approaches for rumor tracking are not scalable because of their dependency on a significant amount of labeled data. In this work, we investigate this problem from different angles. We design an Active-Transfer Learning (ATL) strategy to identify rumors with a limited amount of annotated data. We go beyond that and investigate the impact of leveraging various machine learning approaches in addition to different contextual representations. We discuss the impact of multiple classifiers on a limited amount of annotated data followed by an interactive approach to gradually update the models by adding the least certain samples (LCS) from the pool of unlabeled data. Our proposed Active Learning (AL) strategy achieves faster convergence in terms of the F-score while requiring fewer annotated samples (42% of the whole dataset for the best model).

2020

pdf bib abs
FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization
Esin Durmus | He He | Mona Diab
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural abstractive summarization models are prone to generate content inconsistent with the source document, i.e. unfaithful. Existing automatic metrics do not capture such mistakes effectively. We tackle the problem of evaluating faithfulness of a generated summary given its source document. We first collected human annotations of faithfulness for outputs from numerous models on two datasets. We find that current models exhibit a trade-off between abstractiveness and faithfulness: outputs with less word overlap with the source document are more likely to be unfaithful. Next, we propose an automatic question answering (QA) based metric for faithfulness, FEQA, which leverages recent advances in reading comprehension. Given question-answer pairs generated from the summary, a QA model extracts answers from the document; non-matched answers indicate unfaithful information in the summary. Among metrics based on word overlap, embedding similarity, and learned language understanding models, our QA-based metric has significantly higher correlation with human faithfulness scores, especially on highly abstractive summaries.

pdf bib abs
A Multitask Learning Approach for Diacritic Restoration
Sawsan Alqahtani | Ajay Mishra | Mona Diab
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In many languages like Arabic, diacritics are used to specify pronunciations as well as meanings. Such diacritics are often omitted in written text, increasing the number of possible pronunciations and meanings for a word. This results in a more ambiguous text making computational processing on such text more difficult. Diacritic restoration is the task of restoring missing diacritics in the written text. Most state-of-the-art diacritic restoration models are built on character level information which helps generalize the model to unseen data, but presumably lose useful information at the word level. Thus, to compensate for this loss, we investigate the use of multi-task learning to jointly optimize diacritic restoration with related NLP problems namely word segmentation, part-of-speech tagging, and syntactic diacritization. We use Arabic as a case study since it has sufficient data resources for tasks that we consider in our joint modeling. Our joint models significantly outperform the baselines and are comparable to the state-of-the-art models that are more complex relying on morphological analyzers and/or a lot more data (e.g. dialectal data).

pdf bib abs
DeSePtion: Dual Sequence Prediction and Adversarial Examples for Improved Fact-Checking
Christopher Hidey | Tuhin Chakrabarty | Tariq Alhindi | Siddharth Varia | Kriste Krstovski | Mona Diab | Smaranda Muresan
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The increased focus on misinformation has spurred development of data and systems for detecting the veracity of a claim as well as retrieving authoritative evidence. The Fact Extraction and VERification (FEVER) dataset provides such a resource for evaluating endto- end fact-checking, requiring retrieval of evidence from Wikipedia to validate a veracity prediction. We show that current systems for FEVER are vulnerable to three categories of realistic challenges for fact-checking – multiple propositions, temporal reasoning, and ambiguity and lexical variation – and introduce a resource with these types of claims. Then we present a system designed to be resilient to these “attacks” using multiple pointer networks for document selection and jointly modeling a sequence of evidence sentences and veracity relation predictions. We find that in handling these attacks we obtain state-of-the-art results on FEVER, largely due to improved evidence retrieval.

pdf bib
Proceedings of the 4th Workshop on Computational Approaches to Code Switching
Thamar Solorio | Monojit Choudhury | Kalika Bali | Sunayana Sitaram | Amitava Das | Mona Diab
Proceedings of the 4th Workshop on Computational Approaches to Code Switching

pdf bib abs
Detecting Urgency Status of Crisis Tweets: A Transfer Learning Approach for Low Resource Languages
Efsun Sarioglu Kayi | Linyong Nan | Bohan Qu | Mona Diab | Kathleen McKeown
Proceedings of the 28th International Conference on Computational Linguistics

We release an urgency dataset that consists of English tweets relating to natural crises, along with annotations of their corresponding urgency status. Additionally, we release evaluation datasets for two low-resource languages, i.e. Sinhala and Odia, and demonstrate an effective zero-shot transfer from English to these two languages by training cross-lingual classifiers. We adopt cross-lingual embeddings constructed using different methods to extract features of the tweets, including a few state-of-the-art contextual embeddings such as BERT, RoBERTa and XLM-R. We train classifiers of different architectures on the extracted features. We also explore semi-supervised approaches by utilizing unlabeled tweets and experiment with ensembling different classifiers. With very limited amounts of labeled data in English and zero data in the low resource languages, we show a successful framework of training monolingual and cross-lingual classifiers using deep learning methods which are known to be data hungry. Specifically, we show that the recent deep contextual embeddings are also helpful when dealing with very small-scale datasets. Classifiers that incorporate RoBERTa yield the best performance for English urgency detection task, with F1 scores that are more than 25 points over our baseline classifier. For the zero-shot transfer to low resource languages, classifiers that use LASER features perform the best for Sinhala transfer while XLM-R features benefit the Odia transfer the most.

pdf bib abs
Multitask Learning for Cross-Lingual Transfer of Broad-coverage Semantic Dependencies
Maryam Aminian | Mohammad Sadegh Rasooli | Mona Diab
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We describe a method for developing broad-coverage semantic dependency parsers for languages for which no semantically annotated resource is available. We leverage a multitask learning framework coupled with annotation projection. We use syntactic parsing as the auxiliary task in our multitask setup. Our annotation projection experiments from English to Czech show that our multitask setup yields 3.1% (4.2%) improvement in labeled F1-score on in-domain (out-of-domain) test set compared to a single-task baseline.

pdf bib abs
Diversity, Density, and Homogeneity: Quantitative Characteristic Metrics for Text Collections
Yi-An Lai | Xuan Zhu | Yi Zhang | Mona Diab
Proceedings of the Twelfth Language Resources and Evaluation Conference

Summarizing data samples by quantitative measures has a long history, with descriptive statistics being a case in point. However, as natural language processing methods flourish, there are still insufficient characteristic metrics to describe a collection of texts in terms of the words, sentences, or paragraphs they comprise. In this work, we propose metrics of diversity, density, and homogeneity that quantitatively measure the dispersion, sparsity, and uniformity of a text collection. We conduct a series of simulations to verify that each metric holds desired properties and resonates with human intuitions. Experiments on real-world datasets demonstrate that the proposed characteristic metrics are highly correlated with text classification performance of a renowned model, BERT, which could inspire future applications.

pdf bib abs
Learning to Classify Intents and Slot Labels Given a Handful of Examples
Jason Krone | Yi Zhang | Mona Diab
Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI

Intent classification (IC) and slot filling (SF) are core components in most goal-oriented dialogue systems. Current IC/SF models perform poorly when the number of training examples per class is small. We propose a new few-shot learning task, few-shot IC/SF, to study and improve the performance of IC and SF models on classes not seen at training time in ultra low resource scenarios. We establish a few-shot IC/SF benchmark by defining few-shot splits for three public IC/SF datasets, ATIS, TOP, and Snips. We show that two popular few-shot learning algorithms, model agnostic meta learning (MAML) and prototypical networks, outperform a fine-tuning baseline on this benchmark. Prototypical networks achieves significant gains in IC performance on the ATIS and TOP datasets, while both prototypical networks and MAML outperform the baseline with respect to SF on all three datasets. In addition, we demonstrate that joint training as well as the use of pre-trained language models, ELMo and BERT in our case, are complementary to these few-shot learning methods and yield further gains.

2019

pdf bib abs
CASA-NLU: Context-Aware Self-Attentive Natural Language Understanding for Task-Oriented Chatbots
Arshit Gupta | Peng Zhang | Garima Lalwani | Mona Diab
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Natural Language Understanding (NLU) is a core component of dialog systems. It typically involves two tasks - Intent Classification (IC) and Slot Labeling (SL), which are then followed by a dialogue management (DM) component. Such NLU systems cater to utterances in isolation, thus pushing the problem of context management to DM. However, contextual information is critical to the correct prediction of intents in a conversation. Prior work on contextual NLU has been limited in terms of the types of contextual signals used and the understanding of their impact on the model. In this work, we propose a context-aware self-attentive NLU (CASA-NLU) model that uses multiple signals over a variable context window, such as previous intents, slots, dialog acts and utterances, in addition to the current user utterance. CASA-NLU outperforms a recurrent contextual NLU baseline on two conversational datasets, yielding a gain of up to 7% on the IC task. Moreover, a non-contextual variant of CASA-NLU achieves state-of-the-art performance on standard public datasets - SNIPS and ATIS.

pdf bib abs
Efficient Convolutional Neural Networks for Diacritic Restoration
Sawsan Alqahtani | Ajay Mishra | Mona Diab
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Diacritic restoration has gained importance with the growing need for machines to understand written texts. The task is typically modeled as a sequence labeling problem and currently Bidirectional Long Short Term Memory (BiLSTM) models provide state-of-the-art results. Recently, Bai et al. (2018) show the advantages of Temporal Convolutional Neural Networks (TCN) over Recurrent Neural Networks (RNN) for sequence modeling in terms of performance and computational resources. As diacritic restoration benefits from both previous as well as subsequent timesteps, we further apply and evaluate a variant of TCN, Acausal TCN (A-TCN), which incorporates context from both directions (previous and future) rather than strictly incorporating previous context as in the case of TCN. A-TCN yields significant improvement over TCN for diacritization in three different languages: Arabic, Yoruba, and Vietnamese. Furthermore, A-TCN and BiLSTM have comparable performance, making A-TCN an efficient alternative over BiLSTM since convolutions can be trained in parallel. A-TCN is significantly faster than BiLSTM at inference time (270% 334% improvement in the amount of text diacritized per minute).

pdf bib abs
Efficient Sentence Embedding using Discrete Cosine Transform
Nada Almarwani | Hanan Aldarmaki | Mona Diab
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Vector averaging remains one of the most popular sentence embedding methods in spite of its obvious disregard for syntactic structure. While more complex sequential or convolutional networks potentially yield superior classification performance, the improvements in classification accuracy are typically mediocre compared to the simple vector averaging. As an efficient alternative, we propose the use of discrete cosine transform (DCT) to compress word sequences in an order-preserving manner. The lower order DCT coefficients represent the overall feature patterns in sentences, which results in suitable embeddings for tasks that could benefit from syntactic features. Our results in semantic probing tasks demonstrate that DCT embeddings indeed preserve more syntactic information compared with vector averaging. With practically equivalent complexity, the model yields better overall performance in downstream classification tasks that correlate with syntactic features, which illustrates the capacity of DCT to preserve word order information.

pdf bib abs
Multi-Domain Goal-Oriented Dialogues (MultiDoGO): Strategies toward Curating and Annotating Large Scale Dialogue Data
Denis Peskov | Nancy Clarke | Jason Krone | Brigi Fodor | Yi Zhang | Adel Youssef | Mona Diab
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The need for high-quality, large-scale, goal-oriented dialogue datasets continues to grow as virtual assistants become increasingly wide-spread. However, publicly available datasets useful for this area are limited either in their size, linguistic diversity, domain coverage, or annotation granularity. In this paper, we present strategies toward curating and annotating large scale goal oriented dialogue data. We introduce the MultiDoGO dataset to overcome these limitations. With a total of over 81K dialogues harvested across six domains, MultiDoGO is over 8 times the size of MultiWOZ, the other largest comparable dialogue dataset currently available to the public. Over 54K of these harvested conversations are annotated for intent classes and slot labels. We adopt a Wizard-of-Oz approach wherein a crowd-sourced worker (the “customer”) is paired with a trained annotator (the “agent”). The data curation process was controlled via biases to ensure a diversity in dialogue flows following variable dialogue policies. We provide distinct class label tags for agents vs. customer utterances, along with applicable slot labels. We also compare and contrast our strategies on annotation granularity, i.e. turn vs. sentence level. Furthermore, we compare and contrast annotations curated by leveraging professional annotators vs the crowd. We believe our strategies for eliciting and annotating such a dialogue dataset scales across modalities and domains and potentially languages in the future. To demonstrate the efficacy of our devised strategies we establish neural baselines for classification on the agent and customer utterances as well as slot labeling for each domain.

pdf bib abs
Identifying Nuances in Fake News vs. Satire: Using Semantic and Linguistic Cues
Or Levi | Pedram Hosseini | Mona Diab | David Broniatowski
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda

The blurry line between nefarious fake news and protected-speech satire has been a notorious struggle for social media platforms. Further to the efforts of reducing exposure to misinformation on social media, purveyors of fake news have begun to masquerade as satire sites to avoid being demoted. In this work, we address the challenge of automatically classifying fake news versus satire. Previous work have studied whether fake news and satire can be distinguished based on language differences. Contrary to fake news, satire stories are usually humorous and carry some political or social message. We hypothesize that these nuances could be identified using semantic and linguistic cues. Consequently, we train a machine learning method using semantic representation, with a state-of-the-art contextual language model, and with linguistic features based on textual coherence metrics. Empirical evaluation attests to the merits of our approach compared to the language-based baseline and sheds light on the nuances between fake news and satire. As avenues for future work, we consider studying additional linguistic features related to the humor aspect, and enriching the data with current news events, to help identify a political or social message.

pdf bib abs
Context-Aware Cross-Lingual Mapping
Hanan Aldarmaki | Mona Diab
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Cross-lingual word vectors are typically obtained by fitting an orthogonal matrix that maps the entries of a bilingual dictionary from a source to a target vector space. Word vectors, however, are most commonly used for sentence or document-level representations that are calculated as the weighted average of word embeddings. In this paper, we propose an alternative to word-level mapping that better reflects sentence-level cross-lingual similarity. We incorporate context in the transformation matrix by directly mapping the averaged embeddings of aligned sentences in a parallel corpus. We also implement cross-lingual mapping of deep contextualized word embeddings using parallel sentences with word alignments. In our experiments, both approaches resulted in cross-lingual sentence embeddings that outperformed context-independent word mapping in sentence translation retrieval. Furthermore, the sentence-level transformation could be used for word-level mapping without loss in word translation quality.

pdf bib abs
Scalable Cross-Lingual Transfer of Neural Sentence Embeddings
Hanan Aldarmaki | Mona Diab
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

We develop and investigate several cross-lingual alignment approaches for neural sentence embedding models, such as the supervised inference classifier, InferSent, and sequential encoder-decoder models. We evaluate three alignment frameworks applied to these models: joint modeling, representation transfer learning, and sentence mapping, using parallel text to guide the alignment. Our results support representation transfer as a scalable approach for modular cross-lingual alignment of neural sentence embeddings, where we observe better performance compared to joint models in intrinsic and extrinsic evaluations, particularly with smaller sets of parallel data.

pdf bib abs
GWU NLP Lab at SemEval-2019 Task 3 : EmoContext: Effectiveness ofContextual Information in Models for Emotion Detection inSentence-level at Multi-genre Corpus
Shabnam Tafreshi | Mona Diab
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper we present an emotion classifier models that submitted to the SemEval-2019 Task 3 : EmoContext. Our approach is a Gated Recurrent Neural Network (GRU) model with attention layer is bootstrapped with contextual information and trained with a multigenre corpus, which is combination of several popular emotional data sets. We utilize different word embeddings to empirically select the most suited embedding to represent our features. Our aim is to build a robust emotion classifier that can generalize emotion detection, which is to learn emotion cues in a noisy training environment. To fulfill this aim we train our model with a multigenre emotion corpus, this way we leverage from having more training set. We achieved overall %56.05 f1-score and placed 144. Given our aim and noisy training environment, the results are anticipated.

pdf bib abs
GWU NLP at SemEval-2019 Task 7: Hybrid Pipeline for Rumour Veracity and Stance Classification on Social Media
Sardar Hamidian | Mona Diab
Proceedings of the 13th International Workshop on Semantic Evaluation

Social media plays a crucial role as the main resource news for information seekers online. However, the unmoderated feature of social media platforms lead to the emergence and spread of untrustworthy contents which harm individuals or even societies. Most of the current automated approaches for automatically determining the veracity of a rumor are not generalizable for novel emerging topics. This paper describes our hybrid system comprising rules and a machine learning model which makes use of replied tweets to identify the veracity of the source tweet. The proposed system in this paper achieved 0.435 F-Macro in stance classification, and 0.262 F-macro and 0.801 RMSE in rumor verification tasks in Task7 of SemEval 2019.

pdf bib abs
Cross-Lingual Transfer of Semantic Roles: From Raw Text to Semantic Roles
Maryam Aminian | Mohammad Sadegh Rasooli | Mona Diab
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

We describe a transfer method based on annotation projection to develop a dependency-based semantic role labeling system for languages for which no supervised linguistic information other than parallel data is available. Unlike previous work that presumes the availability of supervised features such as lemmas, part-of-speech tags, and dependency parse trees, we only make use of word and character features. Our deep model considers using character-based representations as well as unsupervised stem embeddings to alleviate the need for supervised features. Our experiments outperform a state-of-the-art method that uses supervised lexico-syntactic features on 6 out of 7 languages in the Universal Proposition Bank.

pdf bib abs
Leveraging Pretrained Word Embeddings for Part-of-Speech Tagging of Code Switching Data
Fahad AlGhamdi | Mona Diab
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

Linguistic Code Switching (CS) is a phenomenon that occurs when multilingual speakers alternate between two or more languages/dialects within a single conversation. Processing CS data is especially challenging in intra-sentential data given state-of-the-art monolingual NLP technologies since such technologies are geared toward the processing of one language at a time. In this paper, we address the problem of Part-of-Speech tagging (POS) in the context of linguistic code switching (CS). We explore leveraging multiple neural network architectures to measure the impact of different pre-trained embeddings methods on POS tagging CS data. We investigate the landscape in four CS language pairs, Spanish-English, Hindi-English, Modern Standard Arabic- Egyptian Arabic dialect (MSA-EGY), and Modern Standard Arabic- Levantine Arabic dialect (MSA-LEV). Our results show that multilingual embedding (e.g., MSA-EGY and MSA-LEV) helps closely related languages (EGY/LEV) but adds noise to the languages that are distant (SPA/HIN). Finally, we show that our proposed models outperform state-of-the-art CS taggers for MSA-EGY language pair.

pdf bib abs
Homograph Disambiguation through Selective Diacritic Restoration
Sawsan Alqahtani | Hanan Aldarmaki | Mona Diab
Proceedings of the Fourth Arabic Natural Language Processing Workshop

Lexical ambiguity, a challenging phenomenon in all natural languages, is particularly prevalent for languages with diacritics that tend to be omitted in writing, such as Arabic. Omitting diacritics leads to an increase in the number of homographs: different words with the same spelling. Diacritic restoration could theoretically help disambiguate these words, but in practice, the increase in overall sparsity leads to performance degradation in NLP applications. In this paper, we propose approaches for automatically marking a subset of words for diacritic restoration, which leads to selective homograph disambiguation. Compared to full or no diacritic restoration, these approaches yield selectively-diacritized datasets that balance sparsity and lexical disambiguation. We evaluate the various selection strategies extrinsically on several downstream applications: neural machine translation, part-of-speech tagging, and semantic textual similarity. Our experiments on Arabic show promising results, where our devised strategies on selective diacritization lead to a more balanced and consistent performance in downstream applications.

2018

pdf bib abs
Evaluation of Unsupervised Compositional Representations
Hanan Aldarmaki | Mona Diab
Proceedings of the 27th International Conference on Computational Linguistics

We evaluated various compositional models, from bag-of-words representations to compositional RNN-based models, on several extrinsic supervised and unsupervised evaluation benchmarks. Our results confirm that weighted vector averaging can outperform context-sensitive models in most benchmarks, but structural features encoded in RNN models can also be useful in certain classification tasks. We analyzed some of the evaluation datasets to identify the aspects of meaning they measure and the characteristics of the various models that explain their performance variance.

pdf bib abs
Emotion Detection and Classification in a Multigenre Corpus with Joint Multi-Task Deep Learning
Shabnam Tafreshi | Mona Diab
Proceedings of the 27th International Conference on Computational Linguistics

Detection and classification of emotion categories expressed by a sentence is a challenging task due to subjectivity of emotion. To date, most of the models are trained and evaluated on single genre and when used to predict emotion in different genre their performance drops by a large margin. To address the issue of robustness, we model the problem within a joint multi-task learning framework. We train this model with a multigenre emotion corpus to predict emotions across various genre. Each genre is represented as a separate task, we use soft parameter shared layers across the various tasks. our experimental results show that this model improves the results across the various genres, compared to a single genre training in the same neural net architecture.

pdf bib
WASA: A Web Application for Sequence Annotation
Fahad AlGhamdi | Mona Diab
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Sentence and Clause Level Emotion Annotation, Detection, and Classification in a Multi-Genre Corpus
Shabnam Tafreshi | Mona Diab
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs
Unsupervised Word Mapping Using Structural Similarities in Monolingual Embeddings
Hanan Aldarmaki | Mahesh Mohan | Mona Diab
Transactions of the Association for Computational Linguistics, Volume 6

Most existing methods for automatic bilingual dictionary induction rely on prior alignments between the source and target languages, such as parallel corpora or seed dictionaries. For many language pairs, such supervised alignments are not readily available. We propose an unsupervised approach for learning a bilingual dictionary for a pair of languages given their independently-learned monolingual word embeddings. The proposed method exploits local and global structures in monolingual vector spaces to align them such that similar words are mapped to each other. We show empirically that the performance of bilingual correspondents that are learned using our proposed unsupervised method is comparable to that of using supervised bilingual correspondents from a seed dictionary.

pdf bib
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching
Gustavo Aguilar | Fahad AlGhamdi | Victor Soto | Thamar Solorio | Mona Diab | Julia Hirschberg
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

pdf bib abs
Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task
Gustavo Aguilar | Fahad AlGhamdi | Victor Soto | Mona Diab | Julia Hirschberg | Thamar Solorio
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

In the third shared task of the Computational Approaches to Linguistic Code-Switching (CALCS) workshop, we focus on Named Entity Recognition (NER) on code-switched social-media data. We divide the shared task into two competitions based on the English-Spanish (ENG-SPA) and Modern Standard Arabic-Egyptian (MSA-EGY) language pairs. We use Twitter data and 9 entity types to establish a new dataset for code-switched NER benchmarks. In addition to the CS phenomenon, the diversity of the entities and the social media challenges make the task considerably hard to process. As a result, the best scores of the competitions are 63.76% and 71.61% for ENG-SPA and MSA-EGY, respectively. We present the scores of 9 participants and discuss the most common challenges among submissions.

pdf bib abs
Team SWEEPer: Joint Sentence Extraction and Fact Checking with Pointer Networks
Christopher Hidey | Mona Diab
Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)

Many tasks such as question answering and reading comprehension rely on information extracted from unreliable sources. These systems would thus benefit from knowing whether a statement from an unreliable source is correct. We present experiments on the FEVER (Fact Extraction and VERification) task, a shared task that involves selecting sentences from Wikipedia and predicting whether a claim is supported by those sentences, refuted, or there is not enough information. Fact checking is a task that benefits from not only asserting or disputing the veracity of a claim but also finding evidence for that position. As these tasks are dependent on each other, an ideal model would consider the veracity of the claim when finding evidence and also find only the evidence that is relevant. We thus jointly model sentence extraction and verification on the FEVER shared task. Among all participants, we ranked 5th on the blind test set (prior to any additional human evaluation of the evidence).

2017

pdf bib
Traitement Automatique des Langues, Volume 58, Numéro 3 : Traitement automatique de l'arabe et des langues apparentées [NLP for Arabic and Related Languages]
Mona Diab | Nizar Habash | Imed Zitouni
Traitement Automatique des Langues, Volume 58, Numéro 3 : Traitement automatique de l'arabe et des langues apparentées [NLP for Arabic and Related Languages]

pdf bib
NLP for Arabic and Related Languages
Mona Diab | Nizar Habash | Imed Zitouni
Traitement Automatique des Langues, Volume 58, Numéro 3 : Traitement automatique de l'arabe et des langues apparentées [NLP for Arabic and Related Languages]

pdf bib abs
Transferring Semantic Roles Using Translation and Syntactic Information
Maryam Aminian | Mohammad Sadegh Rasooli | Mona Diab
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Our paper addresses the problem of annotation projection for semantic role labeling for resource-poor languages using supervised annotations from a resource-rich language through parallel data. We propose a transfer method that employs information from source and target syntactic dependencies as well as word alignment density to improve the quality of an iterative bootstrapping method. Our experiments yield a 3.5 absolute labeled F-score improvement over a standard annotation projection method.

pdf bib abs
Predictive Linguistic Features of Schizophrenia
Efsun Sarioglu Kayi | Mona Diab | Luca Pauselli | Michael Compton | Glen Coppersmith
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

Schizophrenia is one of the most disabling and difficult to treat of all human medical/health conditions, ranking in the top ten causes of disability worldwide. It has been a puzzle in part due to difficulty in identifying its basic, fundamental components. Several studies have shown that some manifestations of schizophrenia (e.g., the negative symptoms that include blunting of speech prosody, as well as the disorganization symptoms that lead to disordered language) can be understood from the perspective of linguistics. However, schizophrenia research has not kept pace with technologies in computational linguistics, especially in semantics and pragmatics. As such, we examine the writings of schizophrenia patients analyzing their syntax, semantics and pragmatics. In addition, we analyze tweets of (self proclaimed) schizophrenia patients who publicly discuss their diagnoses. For writing samples dataset, syntactic features are found to be the most successful in classification whereas for the less structured Twitter dataset, a combination of features performed the best.

pdf bib abs
SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation
Daniel Cer | Mona Diab | Eneko Agirre | Iñigo Lopez-Gazpio | Lucia Specia
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017).

pdf bib abs
GW_QA at SemEval-2017 Task 3: Question Answer Re-ranking on Arabic Fora
Nada Almarwani | Mona Diab
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes our submission to SemEval-2017 Task 3 Subtask D, “Question Answer Ranking in Arabic Community Question Answering”. In this work, we applied a supervised machine learning approach to automatically re-rank a set of QA pairs according to their relevance to a given question. We employ features based on latent semantic models, namely WTMF, as well as a set of lexical features based on string lengths and surface level matching. The proposed system ranked first out of 3 submissions, with a MAP score of 61.16%.

pdf bib abs
A Layered Language Model based Hybrid Approach to Automatic Full Diacritization of Arabic
Mohamed Al-Badrashiny | Abdelati Hawwari | Mona Diab
Proceedings of the Third Arabic Natural Language Processing Workshop

In this paper we present a system for automatic Arabic text diacritization using three levels of analysis granularity in a layered back off manner. We build and exploit diacritized language models (LM) for each of three different levels of granularity: surface form, morphologically segmented into prefix/stem/suffix, and character level. For each of the passes, we use Viterbi search to pick the most probable diacritization per word in the input. We start with the surface form LM, followed by the morphological level, then finally we leverage the character level LM. Our system outperforms all of the published systems evaluated against the same training and test data. It achieves a 10.87% WER for complete full diacritization including lexical and syntactic diacritization, and 3.0% WER for lexical diacritization, ignoring syntactic diacritization.

pdf bib abs
Arabic Textual Entailment with Word Embeddings
Nada Almarwani | Mona Diab
Proceedings of the Third Arabic Natural Language Processing Workshop

Determining the textual entailment between texts is important in many NLP tasks, such as summarization, question answering, and information extraction and retrieval. Various methods have been suggested based on external knowledge sources; however, such resources are not always available in all languages and their acquisition is typically laborious and very costly. Distributional word representations such as word embeddings learned over large corpora have been shown to capture syntactic and semantic word relationships. Such models have contributed to improving the performance of several NLP tasks. In this paper, we address the problem of textual entailment in Arabic. We employ both traditional features and distributional representations. Crucially, we do not depend on any external resources in the process. Our suggested approach yields state of the art performance on a standard data set, ArbTE, achieving an accuracy of 76.2 % compared to state of the art of 69.3 %.

2016

pdf bib abs
Investigating the Impact of Various Partial Diacritization Schemes on Arabic-English Statistical Machine Translation
Sawsan Alqahtani | Mahmoud Ghoneim | Mona Diab
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track

Most diacritics in Arabic represent short vowels. In Arabic orthography, such diacritics are considered optional. The absence of these diacritics naturally leads to significant word ambiguity to top the inherent ambiguity present in fully diacritized words. Word ambiguity is a significant impediment for machine translation. Despite the ambiguity presented by lack of diacritization, context helps ameliorate the situation. Identifying the appropriate amount of diacritic restoration to reduce word sense ambiguity in the context of machine translation is the object of this paper. Diacritic marks help reduce the number of possible lexical word choices assigned to a source word which leads to better quality translated sentences. We investigate a variety of (linguistically motivated) partial diacritization schemes that preserve some of the semantics that in essence complement the implicit contextual information present in the sentences. We also study the effect of training data size and report results on three standard test sets that represent a combination of different genres. The results show statistically significant improvements for some schemes compared to two baselines: text with no diacritics (the typical writing system adopted for Arabic) and text that is fully diacritized.

pdf bib abs
LILI: A Simple Language Independent Approach for Language Identification
Mohamed Al-Badrashiny | Mona Diab
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We introduce a generic Language Independent Framework for Linguistic Code Switch Point Detection. The system uses characters level 5-grams and word level unigram language models to train a conditional random fields (CRF) model for classifying input words into various languages. We test our proposed framework and compare it to the state-of-the-art published systems on standard data sets from several language pairs: English-Spanish, Nepali-English, English-Hindi, Arabizi (Refers to Arabic written using the Latin/Roman script)-English, Arabic-Engari (Refers to English written using Arabic script), Modern Standard Arabic(MSA)-Egyptian, Levantine-MSA, Gulf-MSA, one more English-Spanish, and one more MSA-EGY. The overall weighted average F-score of each language pair are 96.4%, 97.3%, 98.0%, 97.0%, 98.9%, 86.3%, 88.2%, 90.6%, 95.2%, and 85.0% respectively. The results show that our approach despite its simplicity, either outperforms or performs at comparable levels to state-of-the-art published systems.

pdf bib abs
Explicit Fine grained Syntactic and Semantic Annotation of the Idafa Construction in Arabic
Abdelati Hawwari | Mohammed Attia | Mahmoud Ghoneim | Mona Diab
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Idafa in traditional Arabic grammar is an umbrella construction that covers several phenomena including what is expressed in English as noun-noun compounds and Saxon and Norman genitives. Additionally, Idafa participates in some other constructions, such as quantifiers, quasi-prepositions, and adjectives. Identifying the various types of the Idafa construction (IC) is of importance to Natural Language processing (NLP) applications. Noun-Noun compounds exhibit special behavior in most languages impacting their semantic interpretation. Hence distinguishing them could have an impact on downstream NLP applications. The most comprehensive syntactic representation of the Arabic language is the LDC Arabic Treebank (ATB). In the ATB, ICs are not explicitly labeled and furthermore, there is no distinction between ICs of noun-noun relations and other traditional ICs. Hence, we devise a detailed syntactic and semantic typification process of the IC phenomenon in Arabic. We target the ATB as a platform for this classification. We render the ATB annotated with explicit IC labels but with the further semantic characterization which is useful for syntactic, semantic and cross language processing. Our typification of IC comprises 3 main syntactic IC types: FIC, GIC, and TIC, and they are further divided into 10 syntactic subclasses. The TIC group is further classified into semantic relations. We devise a method for automatic IC labeling and compare its yield against the CATiB treebank. Our evaluation shows that we achieve the same level of accuracy, but with the additional fine-grained classification into the various syntactic and semantic types.

This paper presents the annotation guidelines developed as part of an effort to create a large scale manually diacritized corpus for various Arabic text genres. The target size of the annotated corpus is 2 million words. We summarize the guidelines and describe issues encountered during the training of the annotators. We also discuss the challenges posed by the complexity of the Arabic language and how they are addressed. Finally, we present the diacritization annotation procedure and detail the quality of the resulting annotations.

Text preprocessing is an important and necessary task for all NLP applications. A simple variation in any preprocessing step may drastically affect the final results. Moreover replicability and comparability, as much as feasible, is one of the goals of our scientific enterprise, thus building systems that can ensure the consistency in our various pipelines would contribute significantly to our goals. The problem has become quite pronounced with the abundance of NLP tools becoming more and more available yet with different levels of specifications. In this paper, we present a dynamic unified preprocessing framework and tool, SPLIT, that is highly configurable based on user requirements which serves as a preprocessing tool for several tools at once. SPLIT aims to standardize the implementations of the most important preprocessing steps by allowing for a unified API that could be exchanged across different researchers to ensure complete transparency in replication. The user is able to select the required preprocessing tasks among a long list of preprocessing steps. The user is also able to specify the order of execution which in turn affects the final preprocessing output.

pdf bib abs
Creating a Large Multi-Layered Representational Repository of Linguistic Code Switched Arabic Data
Mona Diab | Mahmoud Ghoneim | Abdelati Hawwari | Fahad AlGhamdi | Nada AlMarwani | Mohamed Al-Badrashiny
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present our effort to create a large Multi-Layered representational repository of Linguistic Code-Switched Arabic data. The process involves developing clear annotation standards and Guidelines, streamlining the annotation process, and implementing quality control measures. We used two main protocols for annotation: in-lab gold annotations and crowd sourcing annotations. We developed a web-based annotation tool to facilitate the management of the annotation process. The current version of the repository contains a total of 886,252 tokens that are tagged into one of sixteen code-switching tags. The data exhibits code switching between Modern Standard Arabic and Egyptian Dialectal Arabic representing three data genres: Tweets, commentaries, and discussion fora. The overall Inter-Annotator Agreement is 93.1%.

pdf bib
CU-GWU Perspective at SemEval-2016 Task 6: Ideological Stance Detection in Informal Text
Heba Elfardy | Mona Diab
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
GWU NLP at SemEval-2016 Shared Task 1: Matrix Factorization for Crosslingual STS
Hanan Aldarmaki | Mona Diab
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
Rumor Identification and Belief Investigation on Twitter
Sardar Hamidian | Mona Diab
Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
Learning Cross-lingual Representations with Matrix Factorization
Hanan Aldarmaki | Mona Diab
Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP

pdf bib
Addressing Annotation Complexity: The Case of Annotating Ideological Perspective in Egyptian Social Media
Heba Elfardy | Mona Diab
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

Arabic writing is typically underspecified for short vowels and other markups, referred to as diacritics. In addition to the lexical ambiguity exhibited in most languages, the lack of diacritics in written Arabic adds another layer of ambiguity which is an artifact of the orthography. In this paper, we present the details of three annotation experimental conditions designed to study the impact of automatic ambiguity detection, on annotation speed and quality in a large scale annotation project.

pdf bib abs
The GW/LT3 VarDial 2016 Shared Task System for Dialects and Similar Languages Detection
Ayah Zirikly | Bart Desmet | Mona Diab
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

This paper describes the GW/LT3 contribution to the 2016 VarDial shared task on the identification of similar languages (task 1) and Arabic dialects (task 2). For both tasks, we experimented with Logistic Regression and Neural Network classifiers in isolation. Additionally, we implemented a cascaded classifier that consists of coarse and fine-grained classifiers (task 1) and a classifier ensemble with majority voting for task 2. The submitted systems obtained state-of-the art performance and ranked first for the evaluation on social media data (test sets B1 and B2 for task 1), with a maximum weighted F1 score of 91.94%.

pdf bib abs
Processing Dialectal Arabic: Exploiting Variability and Similarity to Overcome Challenges and Discover Opportunities
Mona Diab
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

We recently witnessed an exponential growth in dialectal Arabic usage in both textual data and speech recordings especially in social media. Processing such media is of great utility for all kinds of applications ranging from information extraction to social media analytics for political and commercial purposes to building decision support systems. Compared to other languages, Arabic, especially the informal variety, poses a significant challenge to natural language processing algorithms since it comprises multiple dialects, linguistic code switching, and a lack of standardized orthographies, to top its relatively complex morphology. Inherently, the problem of processing Arabic in the context of social media is the problem of how to handle resource poor languages. In this talk I will go over some of our insights to some of these problems and show how there is a silver lining where we can generalize some of our solutions to other low resource language contexts.

pdf bib abs
Automatic Verification and Augmentation of Multilingual Lexicons
Maryam Aminian | Mohamed Al-Badrashiny | Mona Diab
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

We present an approach for automatic verification and augmentation of multilingual lexica. We exploit existing parallel and monolingual corpora to extract multilingual correspondents via tri-angulation. We demonstrate the efficacy of our approach on two publicly available resources: Tharwa, a three-way lexicon comprising Dialectal Arabic, Modern Standard Arabic and English lemmas among other information (Diab et al., 2014); and BabelNet, a multilingual thesaurus comprising over 276 languages including Arabic variant entries (Navigli and Ponzetto, 2012). Our automated approach yields an F1-score of 71.71% in generating correct multilingual correspondents against gold Tharwa, and 54.46% against gold BabelNet without any human intervention.

pdf bib abs
The Power of Language Music: Arabic Lemmatization through Patterns
Mohammed Attia | Ayah Zirikly | Mona Diab
Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex - V)

The interaction between roots and patterns in Arabic has intrigued lexicographers and morphologists for centuries. While roots provide the consonantal building blocks, patterns provide the syllabic vocalic moulds. While roots provide abstract semantic classes, patterns realize these classes in specific instances. In this way both roots and patterns are indispensable for understanding the derivational, morphological and, to some extent, the cognitive aspects of the Arabic language. In this paper we perform lemmatization (a high-level lexical processing) without relying on a lookup dictionary. We use a hybrid approach that consists of a machine learning classifier to predict the lemma pattern for a given stem, and mapping rules to convert stems to their respective lemmas with the vocalization defined by the pattern.

pdf bib abs
SAMER: A Semi-Automatically Created Lexical Resource for Arabic Verbal Multiword Expressions Tokens Paradigm and their Morphosyntactic Features
Mohamed Al-Badrashiny | Abdelati Hawwari | Mahmoud Ghoneim | Mona Diab
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

Although MWE are relatively morphologically and syntactically fixed expressions, several types of flexibility can be observed in MWE, verbal MWE in particular. Identifying the degree of morphological and syntactic flexibility of MWE is very important for many Lexicographic and NLP tasks. Adding MWE variants/tokens to a dictionary resource requires characterizing the flexibility among other morphosyntactic features. Carrying out the task manually faces several challenges since it is a very laborious task time and effort wise, as well as it will suffer from coverage limitation. The problem is exacerbated in rich morphological languages where the average word in Arabic could have 12 possible inflection forms. Accordingly, in this paper we introduce a semi-automatic Arabic multiwords expressions resource (SAMER). We propose an automated method that identifies the morphological and syntactic flexibility of Arabic Verbal Multiword Expressions (AVMWE). All observed morphological variants and syntactic pattern alternations of an AVMWE are automatically acquired using large scale corpora. We look for three morphosyntactic aspects of AVMWE types investigating derivational and inflectional variations and syntactic templates, namely: 1) inflectional variation (inflectional paradigm) and calculating degree of flexibility; 2) derivational productivity; and 3) identifying and classifying the different syntactic types. We build a comprehensive list of AVMWE. Every token in the AVMWE list is lemmatized and tagged with POS information. We then search Arabic Gigaword and All ATBs for all possible flexible matches. For each AVMWE type we generate: a) a statistically ranked list of MWE-lexeme inflections and syntactic pattern alternations; b) An abstract syntactic template; and c) The most frequent form. Our technique is validated using a Golden MWE annotated list. The results shows that the quality of the generated resource is 80.04%.

pdf bib
Proceedings of the Second Workshop on Computational Approaches to Code Switching
Mona Diab | Pascale Fung | Mahmoud Ghoneim | Julia Hirschberg | Thamar Solorio
Proceedings of the Second Workshop on Computational Approaches to Code Switching

pdf bib
The George Washington University System for the Code-Switching Workshop Shared Task 2016
Mohamed Al-Badrashiny | Mona Diab
Proceedings of the Second Workshop on Computational Approaches to Code Switching

2015

pdf bib
AIDA2: A Hybrid Approach for Token and Sentence Level Dialect Identification in Arabic
Mohamed Al-Badrashiny | Heba Elfardy | Mona Diab
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

pdf bib
Ideological Perspective Detection Using Semantic Features
Heba Elfardy | Mona Diab | Chris Callison-Burch
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf bib
Unsupervised False Friend Disambiguation Using Contextual Word Clusters and Parallel Word Alignments
Maryam Aminian | Mahmoud Ghoneim | Mona Diab
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Committed Belief Tagging on the Factbank and LU Corpora: A Comparative Study
Gregory Werner | Vinodkumar Prabhakaran | Mona Diab | Owen Rambow
Proceedings of the Second Workshop on Extra-Propositional Aspects of Meaning in Computational Semantics (ExProM 2015)

pdf bib
Named Entity Recognition for Arabic Social Media
Ayah Zirikly | Mona Diab
Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing

pdf bib
GWU-HASP-2015@QALB-2015 Shared Task: Priming Spelling Candidates with Probability
Mohammed Attia | Mohamed Al-Badrashiny | Mona Diab
Proceedings of the Second Workshop on Arabic Natural Language Processing

pdf bib
Robust Part-of-speech Tagging of Arabic Text
Hanan Aldarmaki | Mona Diab
Proceedings of the Second Workshop on Arabic Natural Language Processing

2014

pdf bib
Fast Tweet Retrieval with Compact Binary Codes
Weiwei Guo | Wei Liu | Mona Diab
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

bib abs
Natural Language Processing of Arabic and its Dialects
Mona Diab | Nizar Habash
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

This tutorial introduces the different challenges and current solutions to the automatic processing of Arabic and its dialects. The tutorial has two parts: First, we present a discussion of generic issues relevant to Arabic NLP and detail dialectal linguistic issues and the challenges they pose for NLP. In the second part, we review the state-of-the-art in Arabic processing covering several enabling technologies and applications, e.g., dialect identification, morphological processing (analysis, disambiguation, tokenization, POS tagging), parsing, and machine translation.

We introduce an electronic three-way lexicon, Tharwa, comprising Dialectal Arabic, Modern Standard Arabic and English correspondents. The paper focuses on Egyptian Arabic as the first pilot dialect for the resource, with plans to expand to other dialects of Arabic in later phases of the project. We describe Tharwas creation process and report on its current status. The lexical entries are augmented with various elements of linguistic information such as POS, gender, rationality, number, and root and pattern information. The lexicon is based on a compilation of information from both monolingual and bilingual existing resources such as paper dictionaries and electronic, corpus-based dictionaries. Multiple levels of quality checks are performed on the output of each step in the creation process. The importance of this lexicon lies in the fact that it is the first resource of its kind bridging multiple variants of Arabic with English. Furthermore, it is a wide coverage lexical resource containing over 73,000 Egyptian entries. Tharwa is publicly available. We believe it will have a significant impact on both Theoretical Linguistics as well as Computational Linguistics research.

In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA (Habash and Rambow, 2005; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al., 2007). MADAMIRA improves upon the two systems with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude. We also discuss an online demo (see http://nlp.ldeo.columbia.edu/madamira/) that highlights these aspects.

pdf bib abs
SANA: A Large Scale Multi-Genre, Multi-Dialect Lexicon for Arabic Subjectivity and Sentiment Analysis
Muhammad Abdul-Mageed | Mona Diab
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The computational treatment of subjectivity and sentiment in natural language is usually significantly improved by applying features exploiting lexical resources where entries are tagged with semantic orientation (e.g., positive, negative values). In spite of the fair amount of work on Arabic sentiment analysis over the past few years (e.g., (Abbasi et al., 2008; Abdul-Mageed et al., 2014; Abdul-Mageed et al., 2012; Abdul-Mageed and Diab, 2012a; Abdul-Mageed and Diab, 2012b; Abdul-Mageed et al., 2011a; Abdul-Mageed and Diab, 2011)), the language remains under-resourced as to these polarity repositories compared to the English language. In this paper, we report efforts to build and present SANA, a large-scale, multi-genre, multi-dialect multi-lingual lexicon for the subjectivity and sentiment analysis of the Arabic language and dialects.

pdf bib
Sentence Level Dialect Identification for Machine Translation System Selection
Wael Salloum | Heba Elfardy | Linda Alamir-Salloum | Nizar Habash | Mona Diab
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
A Framework for the Classification and Annotation of Multiword Expressions in Dialectal Arabic
Abdelati Hawwari | Mohammed Attia | Mona Diab
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

pdf bib
Named Entity Recognition System for Dialectal Arabic
Ayah Zirikly | Mona Diab
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

pdf bib
GWU-HASP: Hybrid Arabic Spelling and Punctuation Corrector
Mohammed Attia | Mohamed Al-Badrashiny | Mona Diab
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

pdf bib
Proceedings of the First Workshop on Computational Approaches to Code Switching
Mona Diab | Julia Hirschberg | Pascale Fung | Thamar Solorio
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
AIDA: Identifying Code Switching in Informal Arabic Text
Heba Elfardy | Mohamed Al-Badrashiny | Mona Diab
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
Handling OOV Words in Dialectal Arabic to English Machine Translation
Maryam Aminian | Mahmoud Ghoneim | Mona Diab
Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants

2013

pdf bib
Multiword Expressions in the Context of Statistical Machine Translation
Mahmoud Ghoneim | Mona Diab
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Improving Lexical Semantics for Sentential Semantics: Modeling Selectional Preference and Similar Words in a Latent Variable Model
Weiwei Guo | Mona Diab
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media
Weiwei Guo | Hao Li | Heng Ji | Mona Diab
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Sentence Level Dialect Identification in Arabic
Heba Elfardy | Mona Diab
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Reranking with Linguistic and Semantic Features for Arabic Optical Character Recognition
Nadi Tomeh | Nizar Habash | Ryan Roth | Noura Farra | Pradeep Dasigi | Mona Diab
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Identifying Opinion Subgroups in Arabic Online Discussions
Amjad Abu-Jbara | Ben King | Mona Diab | Dragomir Radev
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
ASMA: A System for Automatic Segmentation and Morpho-Syntactic Disambiguation of Modern Standard Arabic
Muhammad Abdul-Mageed | Mona Diab | Sandra Kübler
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity
Mona Diab | Tim Baldwin | Marco Baroni
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

pdf bib
*SEM 2013 shared task: Semantic Textual Similarity
Eneko Agirre | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre | Weiwei Guo
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

pdf bib
Semantic Textual Similarity: past present and future
Mona Diab
Proceedings of the Joint Symposium on Semantic Processing. Textual Inference and Structures in Corpora

2012

pdf bib
Who’s (Really) the Boss? Perception of Situational Power in Written Interactions
Vinodkumar Prabhakaran | Owen Rambow | Mona Diab
Proceedings of COLING 2012

pdf bib
Token Level Identification of Linguistic Code Switching
Heba Elfardy | Mona Diab
Proceedings of COLING 2012: Posters

pdf bib abs
Conventional Orthography for Dialectal Arabic
Nizar Habash | Mona Diab | Owen Rambow
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Dialectal Arabic (DA) refers to the day-to-day vernaculars spoken in the Arab world. DA lives side-by-side with the official language, Modern Standard Arabic (MSA). DA differs from MSA on all levels of linguistic representation, from phonology and morphology to lexicon and syntax. Unlike MSA, DA has no standard orthography since there are no Arabic dialect academies, nor is there a large edited body of dialectal literature that follows the same spelling standard. In this paper, we present CODA, a conventional orthography for dialectal Arabic; it is designed primarily for the purpose of developing computational models of Arabic dialects. We explain the design principles of CODA and provide a detailed description of its guidelines as applied to Egyptian Arabic.

pdf bib abs
Simplified guidelines for the creation of Large Scale Dialectal Arabic Annotations
Heba Elfardy | Mona Diab
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The Arabic language is a collection of dialectal variants along with the standard form, Modern Standard Arabic (MSA). MSA is used in official Settings while the dialectal variants (DA) correspond to the native tongue of the Arabic speakers. Arabic speakers typically code switch between DA and MSA, which is reflected extensively in written online social media. Automatic processing such Arabic genre is very difficult for automated NLP tools since the linguistic difference between MSA and DA is quite profound. However, no annotated resources exist for marking the regions of such switches in the utterance. In this paper, we present a simplified Set of guidelines for detecting code switching in Arabic on the word/token level. We use these guidelines in annotating a corpus that is rich in DA with frequent code switching to MSA. We present both a quantitative and qualitative analysis of the annotations.

pdf bib abs
Annotations for Power Relations on Email Threads
Vinodkumar Prabhakaran | Huzaifa Neralwala | Owen Rambow | Mona Diab
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Social relations like power and influence are difficult concepts to define, but are easily recognizable when expressed. In this paper, we describe a multi-layer annotation scheme for social power relations that are recognizable from online written interactions. We introduce a typology of four types of power relations between dialog participants: hierarchical power, situational power, influence and control of communication. We also present a corpus of Enron emails comprising of 122 threaded conversations, manually annotated with instances of these power relations between participants. Our annotations also capture attempts at exercise of power or influence and whether those attempts were successful or not. In addition, we also capture utterance level annotations for overt display of power. We describe the annotation definitions using two example email threads from our corpus illustrating each type of power relation. We also present detailed instructions given to the annotators and provide various statistics on annotations in the corpus.

pdf bib abs
AWATIF: A Multi-Genre Corpus for Modern Standard Arabic Subjectivity and Sentiment Analysis
Muhammad Abdul-Mageed | Mona Diab
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present AWATIF, a multi-genre corpus of Modern Standard Arabic (MSA) labeled for subjectivity and sentiment analysis (SSA) at the sentence level. The corpus is labeled using both regular as well as crowd sourcing methods under three different conditions with two types of annotation guidelines. We describe the sub-corpora constituting the corpus and provide examples from the various SSA categories. In the process, we present our linguistically-motivated and genre-nuanced annotation guidelines and provide evidence showing their impact on the labeling task.

pdf bib
Predicting Overt Display of Power in Written Dialogs
Vinodkumar Prabhakaran | Owen Rambow | Mona Diab
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Arabic Dialect Processing Tutorial
Mona Diab | Nizar Habash
Tutorial Abstracts at the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Subgroup Detection in Ideological Discussions
Amjad Abu-Jbara | Pradeep Dasigi | Mona Diab | Dragomir Radev
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Modeling Sentences in the Latent Space
Weiwei Guo | Mona Diab
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Genre Independent Subgroup Detection in Online Discussion Threads: A Study of Implicit Attitude using Textual Latent Semantics
Pradeep Dasigi | Weiwei Guo | Mona Diab
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Learning the Latent Semantics of a Concept from its Definition
Weiwei Guo | Mona Diab
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)
Eneko Agirre | Johan Bos | Mona Diab | Suresh Manandhar | Yuval Marton | Deniz Yuret
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
Eneko Agirre | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
Weiwei: A Simple Unsupervised Latent Semantics based Approach for Sentence Similarity
Weiwei Guo | Mona Diab
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
A Pilot PropBank Annotation for Quranic Arabic
Wajdi Zaghouani | Abdelati Hawwari | Mona Diab
Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature

pdf bib
Building an Arabic Multiword Expressions Repository
Abdelati Hawwari | Kfir Bar | Mona Diab
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages

pdf bib
SAMAR: A System for Subjectivity and Sentiment Analysis of Arabic Social Media
Muhammad Abdul-Mageed | Sandra Kuebler | Mona Diab
Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis

In this paper, we present the details of creating a pilot Arabic proposition bank (Propbank). Propbanks exist for both English and Chinese. However the morphological and syntactic expression of linguistic phenomena in Arabic yields a very different type of process in creating an Arabic propbank. Hence, we highlight those characteristics of Arabic that make creating a propbank for the language a different challenge compared to the creation of an English Propbank.We believe that many of the lessons learned in dealing with Arabic could generalise to other languages that exhibit equally rich morphology and relatively free word order.

pdf bib
Semantic Role Labeling Systems for Arabic using Kernel Methods
Mona Diab | Alessandro Moschitti | Daniele Pighin
Proceedings of ACL-08: HLT

pdf bib
Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking
Ryan Roth | Owen Rambow | Nizar Habash | Mona Diab | Cynthia Rudin
Proceedings of ACL-08: HLT, Short Papers

pdf bib
Coling 2008: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing
Irina Matveeva | Chris Biemann | Monojit Choudhury | Mona Diab
Coling 2008: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing

2007

pdf bib
Arabic diacritization in the context of statistical machine translation
Mona Diab | Mahmoud Ghoneim | Nizar Habash
Proceedings of Machine Translation Summit XI: Papers

pdf bib
Semi-automatic error analysis for large-scale statistical machine translation
Katrin Kirchhoff | Owen Rambow | Nizar Habash | Mona Diab
Proceedings of Machine Translation Summit XI: Papers

pdf bib
Arabic Dialect Processing Tutorial
Mona Diab | Nizar Habash
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Tutorial Abstracts

pdf bib
CUNIT: A Semantic Role Labeling System for Modern Standard Arabic
Mona Diab | Alessandro Moschitti | Daniele Pighin
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
Improved Arabic Base Phrase Chunking with a new enriched POS tag set
Mona Diab
Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources

2006

pdf bib
Arabic Dialect Processing
Mona Diab | Nizar Habash
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Tutorials

pdf bib
Parsing Arabic Dialects
David Chiang | Mona Diab | Nizar Habash | Owen Rambow | Safiullah Shareef
11th Conference of the European Chapter of the Association for Computational Linguistics

In this paper, we describe the methodological procedures and issues that emerged from the development of a pilot Levantine Arabic Treebank (LATB) at the Linguistic Data Consortium (LDC) and its use at the Johns Hopkins University (JHU) Center for Language and Speech Processing workshop on Parsing Arabic Dialects (PAD). This pilot, consisting of morphological and syntactic annotation of approximately 26,000 words of Levantine Arabic conversational telephone speech, was developed under severe time constraints; hence the LDC team drew on their experience in treebanking Modern Standard Arabic (MSA) text. The resulting Levantine dialect treebanked corpus was used by the PAD team to develop and evaluate parsers for Levantine dialect texts. The parsers were trained on MSA resources and adapted using dialect-MSA lexical resources (some developed especially for this task) and existing linguistic knowledge about syntactic differences between MSA and dialect. The use of the LATB for development and evaluation of syntactic parsers allowed the PAD team to provide feedbasck to the LDC treebank developers. In this paper, we describe the creation of resources for this corpus, as well as transformations on the corpus to eliminate speech effects and lessen the gap between our pre-existing MSA resources and the new dialectal corpus

pdf bib
Unsupervised Induction of Modern Standard Arabic Verb Classes
Neal Snider | Mona Diab
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

pdf bib
Unsupervised Induction of Modern Standard Arabic Verb Classes Using Syntactic Frames and LSA
Neal Snider | Mona Diab
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions