Douglas Briesch
2014
Finding Romanized Arabic Dialect in Code-Mixed Tweets
Clare Voss
|
Stephen Tratz
|
Jamal Laoudi
|
Douglas Briesch
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Recent computational work on Arabic dialect identification has focused primarily on building and annotating corpora written in Arabic script. Arabic dialects however also appear written in Roman script, especially in social media. This paper describes our recent work developing tweet corpora and a token-level classifier that identifies a Romanized Arabic dialect and distinguishes it from French and English in tweets. We focus on Moroccan Darija, one of several spoken vernaculars in the family of Maghrebi Arabic dialects. Even given noisy, code-mixed tweets,the classifier achieved token-level recall of 93.2% on Romanized Arabic dialect, 83.2% on English, and 90.1% on French. The classifier, now integrated into our tweet conversation annotation tool (Tratz et al. 2013), has semi-automated the construction of a Romanized Arabic-dialect lexicon. Two datasets, a full list of Moroccan Darija surface token forms and a table of lexical entries derived from this list with spelling variants, as extracted from our tweet corpus collection, will be made available in the LRE MAP.
2013
Tweet Conversation Annotation Tool with a Focus on an Arabic Dialect, Moroccan Darija
Stephen Tratz
|
Douglas Briesch
|
Jamal Laoudi
|
Clare Voss
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse
2012
Assessing Divergence Measures for Automated Document Routing in an Adaptive MT System
Claire Jaja
|
Douglas Briesch
|
Jamal Laoudi
|
Clare Voss
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Custom machine translation (MT) engines systematically outperform general-domain MT engines when translating within the relevant custom domain. This paper investigates the use of the Jensen-Shannon divergence measure for automatically routing new documents within a translation system with multiple MT engines to the appropriate custom MT engine in order to obtain the best translation. Three distinct domains are compared, and the impact of the language, size, and preprocessing of the documents on the Jensen-Shannon score is addressed. Six test datasets are then compared to the three known-domain corpora to predict which of the three custom MT engines they would be routed to at runtime given their Jensen-Shannon scores. The results are promising for incorporating this divergence measure into a translation workflow.