Workshop on patent translation
The approach presented here enables Japanese users with no knowledge of English or legal English to generate patent claims in English from a Japanese-only interface. It exploits the highly determined structure of patent claims and merges Natural Language Generation (NLG) and Machine Translation (MT) techniques and resources as realized in the AutoPat and PC-Transfer applications. Due to its tuned MT engine, the approach can be seen as a human-aided machine translation (HAMT) system circumventing major obstacles in full-scale Japanese-English MT. The approach is fully implemented on a large scale and will be commercially released in autumn 2005.
In this paper, we present a methodology for the development of interactive domain-tuned patent tools for generating patent claims in English from non-English interfaces. The methodology is based on a merger of an interactive English-to-English patent claim generator, AutoPat1 and any external MT engine that might be appropriate for a certain language. The translation procedure is reduced to translation words and phrases rather than a complex claim sentence. The approach has been successfully used in The J-E patent system 2 , a patent claim generator in English from a Japanese-only interface, and in Dan-Pat3, a similar tool for the Danish-English pair of languages. The two systems use different MT engines but feature similar overall architecture. The methodology is portable to other languages and MT engines.
It is well known that sentences in Japanese patents have long and complicated structures, especially necessary conditions and details. Here, patent sentences are analyzed and classified by pattern of modified relationships. Morphemes were first extracted using the famous morpheme analysis tool Chasen, and then the modified relations were extracted using the software Cabocha. Many modification mistakes were caused by long complicated structures, which required correction by humans. In the process of correction, the modification structure patterns were classified using about 200 sentences. This clarified the characteristics of Japanese patent sentences, and it is useful in machine translation of patent sentences.
A multilingual sense code may chart "constant-sense connection paths" across languages. A writer, not versed in any target language, may nonetheless proofread the sense for translation and edit it, to ensure that his meaning is conveyed as he wishes it, to other languages. A translation-ready format may be thus produced, to serve as a printing-press plate, for precise and automatic translation to any language, or to a plurality of languages. The translation-ready format may describe each word and the full document with a comprehensive code, which specifies the multilingual sense code and other relevant information about the word, in a standardized fashion, digitally, forming a unified, language-independent tagging system and a unified, language-independent lexicon.
Large-scale parallel corpus is extremely important for translation memory, example-based machine translation, and the support system to create English sentences. Organized collection or establishment of large-scale corpus is currently ongoing; however it is a difficult project in terms of copyrights as well as economic efficiency. To investigate general tendency of large-scale corpus helps to improve economical efficiency of parallel corpus collection as well as system establishment. In this study, therefore, the relationship between the scale of parallel corpus and the degree of correspondence is clarified, using parallel corpus for patents.
The paper describes some ways to save on knowledge acquisition when developing MT systems for patents by reducing the size of resources to be acquired, and creating intelligent software for knowledge handling and access speed. The approach is illustrated by knowledge acquisition and maintenance in the APTrans system for translating patent claims. Domain tuned resources are based on contrastive studies of multilingual patent documents and are handled by an electronic dictionary with a powerful user-friendly environment for acquisition, editing, browsing, defaulting and coherence proofing.
The domain dependence of translations of nouns in English-to-Japanese patent translation is examined using an automatic method for identifying major translations from a pair of language corpora in the same domain. The method calculates the ratio of the number of associated words of a target word that suggest each translation of the target word to the total number of associated words. This ratio indicates how major a translation is in a domain. Application of the method to a bilingual patent-abstract corpus indicates the necessity and effectiveness of dividing the patent domain into subdomains and adapting a bilingual dictionary to subdomains.
This paper describes a method for retrieving technical terms and finding their translation candidates from patent corpora. The method improves the reliability of bilingual seed words that measure similarity between a target word and its translation candidates. We conducted an experiment with PAJ (Patent Abstracts of Japan), which is a collection of bilingual patent abstracts written in Japanese and English. The experiment result shows that our method achieves a precision of 53.5% and a recall of 75.4%.
This paper addresses the workflow for terminology construction for Korean-English patent MT system. The workflow consists of the stage for setting lexical goals and the semi- automatic terminology construction stage. As there is no comparable system, it is difficult to determine how many terms are needed. To estimate the number of the needed terms, we analyzed 45,000 patent documents. Given the limited time and budget, we resorted to the semi-automatic methods to create the bilingual term dictionary in electronics domain. We will show that parenthesis information in Korean patent documents and bilingual title corpus can be successfully used to build a bilingual term dictionary.