Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Government MT User Program
As early as June 2003, the United States Army partnered with United States Joint Forces Command to review language requirements within the Army, and, to a lesser extent, the other United States Military Services. After review of missions that require language translation, in 2005 the Army completed an Analysis of Alternatives document, which served as an independent assessment of potential language translation alternatives: options and numerical assessments based on each option’s ability to address language translation requirements. Of the four identified alternatives (printed materials, government off the shelf, commercial off the shelf, and overarching program), incremental development of two-way speech and text translation software modules proved to be the most mission and cost effective. That same year, United States Department of Defense published the Defense Language Transformation Roadmap listing a requirement for a coherent, prioritized, and coordinated multi-language technology research, development and acquisition policy and program. Since 2005, the Army and the Joint Staff have validated requirements for machine foreign language translation capability. In the effort to develop a comprehensive machine foreign translation capability, the Army not only needs to enable software to handle one of the most complex systems that humans deal with, but we need to develop the architecture and processes to routinely produce and maintain this capability. The Army has made the initial effort, funding a machine foreign language translation program known as the Machine Foreign Language Translation System (MFLTS) Program. It is intended to be the overarching Army Program with Department of Defense interest to provide machine foreign language translation capabilities that meet language translation gaps. MFLTS will provide a basic communications and triage capability for speech and text translations and improve those capabilities as the technology advances. Capabilities are intended to be delivered through three configurations: over established networks (web based), in mobile (or desktop) configurations and on portable platforms (or man wearable microprocessors and/or handhelds). MFLTS software, as a mission enabler ported on other platforms and systems, will provide Joint, Allied/Coalition units and personnel with language translation capability within the full range of military operations. Most recently, the Army convened a Machine Foreign Language Translation System (MFLTS) General Office Steering Group (GOSG) in March 2012 and validated follow-on language, domain and technology required capabilities for the Army MFLTS Program beyond the initial capability scheduled for 2014.
Government agencies are investing in MT to boost production, but the future funding picture is uncertain. Decision makers (Congress, OMB, IC leadership) want evidence (quantitative/qualitative) of value for investments. Agencies can use positive ROIs to defend MT investment budgets, plans, and programs, but the information needs to be more than anecdotal.
The explosive growth of social media has led to a wide range of new challenges for machine translation and language processing. The language used in social media occupies a new space between structured and unstructured media, formal and informal language, and dialect and standard usage. Yet these new platforms have given a digital voice to millions of user on the Internet, giving them the opportunity to communicate on the first truly global stage – the Internet. Social media covers a broad category of communications formats, ranging from threaded conversations on Facebook, to microblog and short message content on platforms like Twitter and Weibo – but it also includes user-generated comments on YouTube, as well as the contents of the video itself, and even includes ‘traditional’ blogs and forums. The common thread linking all of these is that the media is generated by, and is targeted at individuals. This talk will survey some of the most popular social media platforms, and identify key challenges in translating the content found in them – including dialect, code switching, mixed encodings, the use of “internet speak”, and platform-specific language phenomena, as well as volume and genre. In addition, we will talk about some of the challenges in analyzing social media from an operational point of view, and how language and translation issues influence higher-level analytic processes such as entity extraction, topic classification and clustering, geo-spatial analysis and other technologies that enable comprehension of social media. These latter capabilities are being adapted for social media analytics for US Government analysts under the support of the Technical Support Working Group at the US DoD, enabling translingual comprehension of this style of content in an operational environment.
In Developers producing language technology for under-resourced languages often find relatively little machine readable text for data required to train machine translation systems. Typically, the kinds of text that are most accessible for production of parallel data are news and news-related genres, yet the language that requires translation for analysts and decision-makers reflects a broad range of forms and contents. The proposed paper will describe an effort funded by the ODNI FLPO in which the Army Research Laboratory, assisted by MITRE language technology researchers, produced a Dari-English parallel corpus containing text in a variety of styles and genres that more closely resemble the kinds of documents needed by government users than do traditional news genres. The data production effort began with a survey of Dari documents catalogued in a government repository of material obtained from the field in Afghanistan. Because the documents in the repository are not available for creation of parallel corpora, the goal was to quantify the types of documents in the collection and identify their linguistic features in order to find documents that are similar. Document images were obtained from two sources: (1) the Preserving and Creating Access to Unique Afghan Records collection, an online resource produced by the University of Arizona Libraries and the Afghanistan Centre at Kabul University and (2) The University of Nebraska Arthur Paul Afghanistan Collection. For the latter, document images were obtained by camera capture of books and by selecting pdf images of microfiche records. A set of 1395 document page images was selected to provide 250,000 translated English words in 10 content domains. The images were transcribed and translated according to specifications designed to maximize the quality and usefulness of the data. The corpus will be used to create a Dari-English glossary, and an experiment will quantify improvements to Dari-English translation of multi-genre text when a generic Dari-English machine translation system is customized using the corpus. The proposed paper will present highlights from these efforts.
The government and the research community have strived for the past few decades to develop machine translation capabilities. Historically, DARPA took the lead in the grand challenge aiming at surpassing human translation quality. While we have made strides from rule based, to statistical and hybrid machine translation engines, we cannot rely solely on machine translation to overcome the language barrier and accomplish the mission. Machine Translation is often misunderstood or misplaced in the operational settings as expectations are unrealistic and optimization not achieved. With the increase in volume, variety and velocity of data, new paradigms are needed when choosing machine translation software and embedding it into a business process so as to achieve the operational goals. The talk will focus on the operational requirements and frame where, when and how to use machine translation. We will also outline some gaps and suggest new areas for research, development, and implementation.
Making the right connections hinges on linking data from disparate sources. Frequently the link may be a person or place, so something as simple as a mistranslated name will cause a search to miss relevant documents. To swiftly and accurately exploit a growing flood of foreign language information acquired for the defense of the nation, Intelligence Community (IC) linguists and analysts need assistance in both translation accuracy and productivity. The name translation and standardizing component of a Computer-Aided Translation (CAT) tool such as the Highlight language analysis suite ensures fast and reliable translation of names from Arabic, Dari, Farsi, and Pashto according to a number of government transliteration standards. Highlight improves efficiency and maximizes the utilization of scarce human resources.
This paper proposes some strategies and techniques for creating phrase-level user parallel corpora for Systran translation engine. Though not all strategies and techniques discussed here will apply to other translation engines, the concept will.
The purpose of this presentation is to discuss recent efforts within the government to address issues of evaluation and return on investment. Pressure to demonstrate value has increased with the growing amount of foreign language information available, with the variety of languages needing to be exploited, and with the increasing gaps between numbers of language-enabled people and the amount of work to be done. This pressure is only growing as budgets shrink, and as global development grows. Over the past year, the ODNI has led an effort to pull together different government stakeholders to determine some baseline standards for determining Return on Investment via task-based evaluation. Stakeholder consensus on major HLT tasks has involved examination of the different approaches to determining return on investment and how it relates use of HLT in the workflow. In addition to reporting on the goals and progress of this group, we will present future directions and invite community input.
It is common knowledge that translation is an ambiguous, 1-to-n mapping process, but to date, our community has produced no empirical estimates of this ambiguity. We have developed an annotation tool that enables us to create representations that compactly encode an exponential number of correct translations for a sentence. Our findings show that naturally occurring sentences have billions of translations. Having access to such large sets of meaning-equivalent translations enables us to develop a new metric, HyTER, for translation accuracy. We show that our metric provides better estimates of machine and human translation accuracy than alternative evaluation metrics using data from the most recent Open MT NIST evaluation and we discuss how HyTER representations can be used to inform a data-driven inquiry into natural language semantics.
Over the years, the government has translated reams of material, transcribed decades of audio, and processed years of text. Where is that material now? How valuable would it be to have that material available to push research and applications and to support foreign language training? Over 20 years ago, DARPA funded the Linguistic Data Consortium (LDC) at the University of Pennsylvania to collect, catalog, store and provide access to language resources. Since that time, the LDC has collected thousands of corpora in many different genres and languages. Although the government has access to the full range of LDC data through a community license, until recently corpora specific to government needs were usually deleted soon after they were created. In order to address the need for a government-only catalog and repository, the Government Catalog of Language Resources was funded through the ODNI, and an initial prototype has been built. The GCLR will be transferred to a government executive agent who will be responsible for making improvements, adding corpora, and maintaining and sustaining the effort. The purpose of this talk is to present the model behind GCLR, to demonstrate its purpose, and to invite attendees to contribute and use contents. Background leading up to the current version will be presented. Use cases of parallel corpora in teaching, technology development and language maintenance will also be covered. Learning from the LDC on how corpora are used, and linking with the LDC will be part of future directions to enable government applications to utilize these resources.
Translation memory (TM) software allows a user to leverage previously translated material in the form of parallel corpora to improve the quality, efficiency, and consistency of future translation work. Within the intelligence community (IC), one of the major bottlenecks in implementing TM systems is developing a relevant parallel corpus. In particular, the IC needs to explore methods of deploying open source corpora for use with TM systems in a classified setting. To address this issue we are devising automated metrics for comparing various corpora in order to predict their usefulness to serve as vaults for particular translation needs. The proposed methodology will guide the use of these corpora, as well as the selection and optimization of novel corpora. One of the critical factors in TM vault creation is optimizing the trade-off between vault size and domain-specificity. Although a larger corpus may be more likely to contain material that matches words or phrases in the material to be translated, there is a danger that some of the proposed matches may include translations that are inappropriate for a given context. If the material in the vault and the material to be translated cover similar domains, the matches provided by the vault may be more likely to occur in the appropriate context. To explore this trade-off we are developing and implementing computational similarity metrics (e.g., n-gram overlap, TF-IDF) for comparison of corpora covering 12 different domains. We are also examining summary statistics produced by TM systems to test the degree to which material from each domain serves as a useful vault for translating material from each of the other domains, as well as the degree to which vault size improves the number and quality of proposed matches. The results of this research will help translation managers and other users assess the utility of a given parallel corpus for their particular translation needs, and may ultimately lead to improved tagging within TM systems to help translators identify the most relevant matches. Use of open source materials allows tool developers and users to leverage existing corpora, thus holding the promise of driving down costs of vault creation and selection. Optimizing vaults also promises to improve the quality, efficiency, and consistency of translation processes and products.
Online communications are playing an unprecedented role in propelling the revolutionary changes that are sweeping throughout the Middle East. A significant portion of that communication is in Romanized Arabic chat (Arabizi), which uses a combination of numerals and Roman characters, as well as non-Arabic words, to write Arabic in place of conventional Arabic script. Language purists in the Arabic-speaking world are lamenting that the use of Arabizi is becoming so profound that it is “destroying the Arabic language.” Despite its widespread use, and significant effect on emerging societies, Government agencies and others have been unable to extract any useful data from Arabizi because of its unconventional characteristics. Therefore, they have had to rely on human, computer-savvy translators, who often are a burden on dwindling resources, and are easily overwhelmed by the sheer volume of incoming data. Our presentation will explore the challenges of triaging and analyzing the Romanized Arabic format and describe Basis Technology’s Arabic chat translation software. This system will convert, for instance, mo2amrat, mo2amaraat, or mou’amret to مؤامرات. The output of standard Arabic can then be exploited for relevant information with a full set of other tools that will index/search, carry out linguistic analyses, extract entities, translate/transliterate names, and machine translate from the Arabic into English or other languages. Because of the nature of Arabizi – writers are able to express themselves in their native Arabic dialects, something that is not so easily done with Modern Standard Arabic – there is a bonus feature in that now we are also able to identify the probable geographical origins of each writer, something that is of great intelligence value. Looking at real-world scenarios, we will discuss how the chat translator can be built into solutions for users to overcome technological, linguistic, and cultural obstacles to achieve operational success and complete tasks.
We present the Reverse Palladius (RevP) program developed by the Air Force Research Laboratory's Speech and Communication Research, Engineering, Analysis, and Modeling (SCREAM) Laboratory for the National Air and Space Intelligence Center (NASIC). The RevP program assists the linguist in correcting the transliteration of Mandarin Chinese names during the Russian to English translation process. Chinese names cause problems for transliteration, because Russian writers follow a specific Palladius mapping for Chinese sounds. Typical machine translation of Russian into English then applies standard transliteration of the Russian sounds in these names, producing errors that require hand-correction. For example, the Chinese name Zhai Zhigang is written in Cyrillic as Чжай Чжиган, and standard transliteration via Systran renders this into English as Chzhay Chzhigan. In contrast, the RevP program uses rules that reverse the Palladius mapping, yielding the correct form Zhai Zhigang. When using the RevP program, the linguist opens a Russian document and selects a Chinese name for transliteration. The rule-based algorithm proposes a reverse Palladius transliteration, as well as a stemmed option if the word terminates in a possible Russian inflection. The linguist confirms the appropriate version of the name, and the program both corrects the current instance and stores the information for future use. The resulting list of name mappings can be used to pre-translate names in new documents, either via stand-alone operation of the RevP program, or through compilation of the list as a Systran user dictionary. The RevP program saves time by removing the need for post-editing of Chinese names, and improves consistency in the translation of these names. The user dictionary becomes more useful over time, further reducing the time required for translation of new documents.