Proceedings of the Second International Conference on Machine Translation: Ten years on
The International Conference, Machine Translation - Ten Years On, took place at Cranfield University, 12-14 November 1994. The occasion was the tenth anniversary of the previous international conference on Machine Translation (MT) held at Cranfield. The 1994 conference was organised by Cranfield University in conjunction with the Natural Language Translation Specialist Group of the British Computer Society. Apart from detailed descriptions of prototype systems, the conference provided overviews of general developments in the field of MT. Considerable research is taking place into speech recognition and dialogue systems, and into incorporating features of spoken language and discourse into computer representations of natural language. At the same time, more sophisticated techniques for the statistical analysis of text corpora are emerging that may fundamentally alter the direction of MT research. It is clear that knowledge-based systems representing conceptual information for particular subject domains independently of specific languages are seen as a practical way forward for MT. Another promising direction is the emergence of interactive systems that can be used by non-translators working within a distributed processing environment. Moving away from research and development, the conference afforded practical insights into a number of operational systems. These ranged from large, established systems such as SYSTRAN, to smaller interactive programs for a PC. The evaluation and commercial performance of MT systems remains a key issue, alongside the wider question of who actually uses MT.
The paper examines briefly the impact of the “statistical turn” in machine translation (MT) R&D in the last decade, and particularly the way in which it has made large scale language resources (lexicons, text corpora etc.) more important than ever before and reinforced the role of evaluation in the development of the field. But resources mean, almost by definition, co-operation between groups and, in the case of MT, specifically co-operation between language groups and states. The paper then considers what alternatives there are now for MT R&D. One is to continue with interlingual methods of translation, even though those are not normally thought of as close to statistical methods. The reason is that statistical methods, taken alone, have almost certainly reached a ceiling in terms of the proportion of sentences and linguistic phenomena they can translate successfully. Interlingual methods remain popular within large electronics companies in Japan, and in a large US Government funded project (PANGLOSS). The question then discussed is what role there can be for interlinguas and interlingual methods in co-operation in MT across linguistic and national boundaries. The paper then turns to evaluation and asks whether, across national and continental boundaries, it can become a co-operative or a “hegemonic” enterprise. Finally the paper turns to resources themselves and asks why co-operation on resources is proving so hard, even though there are bright spots of real co-operation.
Parallel corpora such as the Canadian Hansard corpus and the International Telecommunications Union (ITU) corpus each provide the same text in two or more languages, and have been aptly described as the "Rosetta Stone" of modern corpus linguistics . Their use within MT is burgeoning, permeating all levels of the discipline, and even being used as the basis of full-blown statistically based MT systems. This paper will concern itself with the task of automatic bilingual lexicon construction, which is one of the major goals of the CRATER project (“Corpus Resources and Terminology Extraction”, funded under the MLAP initiative of the CEC, grant number MLAP-93/20). The approach to bilingual lexicon alignment taken here entails the alignment of corpora, and then a detailed search through the corpus for lexical cognates. Consequently the paper will begin with a brief discussion of the alignment procedures used on the project to date, and move to a discussion of various similarity metrics used to evaluate lexical similarity.
The Computational Dictionary, described in this paper, is structured on a knowledge base. The semantic features of each word, in a relevant grammatical category, can be determined through a hierarchical tree structure. Semantic knowledge of verbs is represented using predicate calculus definitions. This allows each expression, e.g. sentence or command, to be tested in order to determine whether it is meaningful and, if meaningful, what its meaning is or indeed whether it is ambiguous.
Progress in Machine Translation (MT) during the last ten years has been observed at different levels, but discourse has yet to make a breakthrough. MT research and development has concentrated so far mostly on sentence translation (discourse analysis being a very complicated task) and the successful operation of most of the working MT systems does not usually go beyond the sentence level. To start with, the paper will refer to the MT research and development in the last ten years at the IAI in Saarbrücken. Next, the MT discourse issues will be discussed both from the point of view of source language analysis and target text generation, and on the basis of the preliminary results of an ongoing "discourse-oriented MT" project . Probably the most important aspect in successfully analysing multisentential source texts is the capacity to establish the anaphoric references to preceding discourse entities. The paper will discuss the problem of anaphora resolution from the perspective of MT. A new integrated model for anaphora resolution, developed for the needs of MT, will be also outlined. As already mentioned, most machine translation systems perform translation sentence by sentence. But even in the case of paragraph translation, the discourse structure of the target text tends to be identical to that of the source text. However, the sublanguage discourse structures may differ across the different languages, and thus a translated text which assumes the same discourse structure as the source text may sound unnatural and perhaps disguise the true intent of the writer. Finally, the paper will outline a new approach for generating discourse structures, appropriate to the target sublanguage and will discuss some of the complicated problems encountered.
The current state of the machine translation system STYLUS is described. The system can produce smooth and accurate translation for more than 80% of source text in the domain chosen. The modular structure of the dictionaries gives the possibility of customising the system to personal needs. The grammar employed is based on ATN-like formalism.
Most translations are needed for technical documents in specific domains and often the domain knowledge available to the translator is crucial for the efficiency and quality of the translation task. Our project1 aims at the investigation of a MAT-paradigm where the human user is supported by linguistic as well as by subject information ([vHa90], [vHAn92]). The basic hypotheses of the approach are: - domain knowledge is not encoded in the lexicon entries, i.e. we clearly distinguish between the language layer and the conceptual layer; - the representation of domain knowledge is language independent and replaces most of the semantic entries in a traditional semantic lexicon of MT/MAT-systems; - the user accesses domain information by highlighting a sequence in the source text and specifying the type of query; - factual explanations to the user should be simple and transparent although the underlying formalisms for knowledge representation and processing might be very complex; - as a language for knowledge representation, conceptual graphs (CGs) of Sowa [Sow84] were chosen. In providing connections between the terms (lexical entries) and the knowledge base our approach will be compared to terminological knowledge bases (TKBs) which are hybrid systems between concept-oriented term banks and knowledge bases. This paper presents: - a contrastive view to knowledge based techniques in MAT, - mechanisms for mapping the "ordinary" linguistic lexicon and the terminological lexicon of two languages onto one knowledge base, - methods to access the domain knowledge in a flexible way without allowing completely free linguistic dialogues, - techniques to present the result of queries to the translator in restricted natural language, and - use of domain knowledge to solve specific translation difficulties.
This paper describes briefly the overall architecture of a machine translation system between French and Arabic in the sub-world of cooking recipes. It continues to describe in more detail the design of the generation component and how this design allows a variety of outputs all expressing the same conceptual meaning. This system is of the family of knowledge-based interlingua translation systems as it emphasises the importance of the meaning of the text being processed and articulates all its available knowledge-bases in order to achieve one major goal: flexible meaningful wording. We agree with S. Nirenburg that "the ability and the right to subdivide sentences or to combine them together in the Target language are powerful tools in the hands of human translators." These are some tools that we want our MT systems to be able to use. The way the system is modularised allowed us to experiment with the generation of: sentence to sentence translations, text to text translations, more concise as opposed to more generalised wording, and varying word orders. The modules are declarative and loosely coupled. This strategy allowed us to experiment with regenerating back the French text. As a matter of fact, the generation component of this MT system is multilingual and capable of accommodating Arabic and French; two languages belonging to two different origins, namely Indo-European and Semitic. This system, being functional in the domain of cooking recipes, allowed us to concentrate on the lexical semantics of its vocabulary and on the modularisation of its linguistic knowledge, whether it is morphological, syntactic or stylistic, as opposed to its pragmatic knowledge. Now that we have tested the design on different languages, we are studying its feasibility in new domains where texts are mainly constituted of verbal phrases, such as in gardening and Chemistry laboratory manuals.
We argue that, in many situations, Dialogue-Based MT is likely to offer better solutions to translation needs than machine aids to translators or batch MT, even if controlled languages are used. Objections to DBMT have led us to introduce the new concept of “self-explaining document”, which might be used in monolingual as well as in multilingual contexts, and deeply change our way of understanding important or difficult written material.
A significant factor in air accidents is "pilot error". Included in this category are errors in natural language communication between the pilot and air traffic control (ATC); errors possibly compounded by the use of English as a standard language for such communication. We concentrate on the likelihood of misunderstanding created by ambiguities in these messages. Often only a few seconds exist between the receipt of an ambiguous message and the subsequent incorrect action (potentially) leading to a fatal accident. We consider the feasibility of filtering each spoken message through an "intelligent computer interface", testing for ambiguities and only transmitting those messages which are clear and unambiguous. Unclear, ambiguous messages should be "authenticated" before transmission. The procedures for computer analysis would require not only sensitive speech recognition equipment but also complex software performing sophisticated linguistic analysis at the phonetic, syntactic, semantic and pragmatic levels. Analysis must also take place in "real time" so that both pilot and controller can receive warning that ambiguities exist in the last communication and corrective action taken in the short time available. Consideration is also given to extending the system from the monolingual to multilingual level allowing pilot and controller each to think and speak in his own native tongue. The sophisticated language analysis is being extended to allow for appropriate disambiguated, bilingual machine translation.
Early attempts to process natural language by mechanical means or machines date back to the thirties of this century. The first machine translation applications are known from the fifties. In view of the long history of machine translation, it is rather strange that even in the mid-nineties this technology is used quite rarely in the daily work of translators. Based on eight years' experience as a user of machine translation (starting with LOGOS and changing to METAL), I will discuss the reasons why translators are still reluctant to use machine translation for their everyday work.
In this paper we take a broad look at the likely implications of future developments in machine translation. In order to do this effectively we consider firstly what constitutes machine translation in all its various forms. We then take a number of scenarios which differ in the extent to which machine translation is successful.