Behrang Mohit - ACL Anthology

Behrang Mohit

2016

Building an Arabic Machine Translation Post-Edited Corpus: Guidelines and Annotation
Wajdi Zaghouani | Nizar Habash | Ossama Obeid | Behrang Mohit | Houda Bouamor | Kemal Oflazer
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present our guidelines and annotation procedure to create a human corrected machine translated post-edited corpus for the Modern Standard Arabic. Our overarching goal is to use the annotated corpus to develop automatic machine translation post-editing systems for Arabic that can be used to help accelerate the human revision process of translated texts. The creation of any manually annotated corpus usually presents many challenges. In order to address these challenges, we created comprehensive and simplified annotation guidelines which were used by a team of five annotators and one lead annotator. In order to ensure a high annotation agreement between the annotators, multiple training sessions were held and regular inter-annotator agreement measures were performed to check the annotation quality. The created corpus of manual post-edited translations of English to Arabic articles is the largest to date for this language pair.

2015

The Second QALB Shared Task on Automatic Text Correction for Arabic
Alla Rozovskaya | Houda Bouamor | Nizar Habash | Wajdi Zaghouani | Ossama Obeid | Behrang Mohit
Proceedings of the Second Workshop on Arabic Natural Language Processing

Correction Annotation for Non-Native Arabic Texts: Guidelines and Corpus
Wajdi Zaghouani | Nizar Habash | Houda Bouamor | Alla Rozovskaya | Behrang Mohit | Abeer Heider | Kemal Oflazer
Proceedings of the 9th Linguistic Annotation Workshop

2014

Large Scale Arabic Error Annotation: Guidelines and Framework
Wajdi Zaghouani | Behrang Mohit | Nizar Habash | Ossama Obeid | Nadi Tomeh | Alla Rozovskaya | Noura Farra | Sarah Alkuhlani | Kemal Oflazer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present annotation guidelines and a web-based annotation framework developed as part of an effort to create a manually annotated Arabic corpus of errors and corrections for various text types. Such a corpus will be invaluable for developing Arabic error correction tools, both for training models and as a gold standard for evaluating error correction algorithms. We summarize the guidelines we created. We also describe issues encountered during the training of the annotators, as well as problems that are specific to the Arabic language that arose during the annotation process. Finally, we present the annotation tool that was developed as part of this project, the annotation pipeline, and the quality of the resulting annotations.

YouDACC: the Youtube Dialectal Arabic Comment Corpus
Ahmed Salama | Houda Bouamor | Behrang Mohit | Kemal Oflazer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents YOUDACC, an automatically annotated large-scale multi-dialectal Arabic corpus collected from user comments on Youtube videos. Our corpus covers different groups of dialects: Egyptian (EG), Gulf (GU), Iraqi (IQ), Maghrebi (MG) and Levantine (LV). We perform an empirical analysis on the crawled corpus and demonstrate that our location-based proposed method is effective for the task of dialect labeling.

The First QALB Shared Task on Automatic Text Correction for Arabic
Behrang Mohit | Alla Rozovskaya | Nizar Habash | Wajdi Zaghouani | Ossama Obeid
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

CMUQ-Hybrid: Sentiment Classification By Feature Engineering and Parameter Tuning
Kamla Al-Mannai | Hanan Alshikhabobakr | Sabih Bin Wasi | Rukhsar Neyaz | Houda Bouamor | Behrang Mohit
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

A Human Judgement Corpus and a Metric for Arabic MT Evaluation
Houda Bouamor | Hanan Alshikhabobakr | Behrang Mohit | Kemal Oflazer
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

CMUQ@Qatar:Using Rich Lexical Features for Sentiment Analysis on Twitter
Sabih Bin Wasi | Rukhsar Neyaz | Houda Bouamor | Behrang Mohit
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

2013

A Web-based Annotation Framework For Large-Scale Text Correction
Ossama Obeid | Wajdi Zaghouani | Behrang Mohit | Nizar Habash | Kemal Oflazer | Nadi Tomeh
The Companion Volume of the Proceedings of IJCNLP 2013: System Demonstrations

SuMT: A Framework of Summarization and MT
Houda Bouamor | Behrang Mohit | Kemal Oflazer
Proceedings of the Sixth International Joint Conference on Natural Language Processing

Dudley North visits North London: Learning When to Transliterate to Arabic
Mahmoud Azab | Houda Bouamor | Behrang Mohit | Kemal Oflazer
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Supersense Tagging for Arabic: the MT-in-the-Middle Attack
Nathan Schneider | Behrang Mohit | Chris Dyer | Kemal Oflazer | Noah A. Smith
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

Transforming Standard Arabic to Colloquial Arabic
Emad Mohamed | Behrang Mohit | Kemal Oflazer
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Annotating and Learning Morphological Segmentation of Egyptian Colloquial Arabic
Emad Mohamed | Behrang Mohit | Kemal Oflazer
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present an annotation and morphological segmentation scheme for Egyptian Colloquial Arabic (ECA) in which we annotate user-generated content that significantly deviates from the orthographic and grammatical rules of Modern Standard Arabic and thus cannot be processed by the commonly used MSA tools. Using a per letter classification scheme in which each letter is classified as either a segment boundary or not, and using a memory-based classifier, with only word-internal context, prove effective and achieve a 92% exact match accuracy at the word level. The well-known MADA system achieves 81% while the per letter classification scheme using the ATB achieves 82%. Error analysis shows that the major problem is that of character ambiguity since the ECA orthography overloads the characters which would otherwise be more specific in MSA, like the differences between y (Ù) and Y (Ù) and A (Ø§) , > ( Ø£), and < (Ø¥) which are collapsed to y (Ù) and A (Ø§) respectively or even totally confused and interchangeable. While normalization helps alleviate orthographic inconsistencies, it aggravates the problem of ambiguity.

Recall-Oriented Learning of Named Entities in Arabic Wikipedia
Behrang Mohit | Nathan Schneider | Rishav Bhowmick | Kemal Oflazer | Noah A. Smith
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study
Nathan Schneider | Behrang Mohit | Kemal Oflazer | Noah A. Smith
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2010

Improving Phrase-Based Translation with Prototypes of Short Phrases
Frank Liberato | Behrang Mohit | Rebecca Hwa
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Using Variable Decoding Weight for Language Model in Statistical Machine Translation
Behrang Mohit | Rebecca Hwa | Alon Lavie
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

This paper investigates varying the decoder weight of the language model (LM) when translating different parts of a sentence. We determine the condition under which the LM weight should be adapted. We find that a better translation can be achieved by varying the LM weight when decoding the most problematic spot in a sentence, which we refer to as a difficult segment. Two adaptation strategies are proposed and compared through experiments. We find that adapting a different LM weight for every difficult segment resulted in the largest improvement in translation quality.

2009

Language Model Adaptation for Difficult to Translate Phrases
Behrang Mohit | Frank Liberato | Rebecca Hwa
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

2007

Localization of Difficult-to-Translate Phrases
Behrang Mohit | Rebecca Hwa
Proceedings of the Second Workshop on Statistical Machine Translation

2005

Syntax-based Semi-Supervised Named Entity Tagging
Behrang Mohit | Rebecca Hwa
Proceedings of the ACL Interactive Poster and Demonstration Sessions

2003

Semantic Extraction with Wide-Coverage Lexical Resources
Behrang Mohit | Srini Narayanan
Companion Volume of the Proceedings of HLT-NAACL 2003 - Short Papers

Co-authors

Venues