Jong-Hyeok Lee

Also published as: Jong-hyeok Lee

2024

pdf bib abs
Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms
Wonkee Lee | Seong-Hwan Heo | Jong-Hyeok Lee
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Semi-supervised learning that leverages synthetic data for training has been widely adopted for developing automatic post-editing (APE) models due to the lack of training data. With this aim, we focus on data-synthesis methods to create high-quality synthetic data. Given that APE takes as input a machine-translation result that might include errors, we present a data-synthesis method by which the resulting synthetic data mimic the translation errors found in actual data. We introduce a noising-based data-synthesis method by adapting the masked language model approach, generating a noisy text from a clean text by infilling masked tokens with erroneous tokens. Moreover, we propose selective corpus interleaving that combines two separate synthetic datasets by taking only the advantageous samples to enhance the quality of the synthetic data further. Experimental results show that using the synthetic data created by our approach results in significantly better APE performance than other synthetic data created by existing methods.

2023

pdf bib abs
Bring More Attention to Syntactic Symmetry for Automatic Postediting of High-Quality Machine Translations
Baikjin Jung | Myungji Lee | Jong-Hyeok Lee | Yunsu Kim
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Automatic postediting (APE) is an automated process to refine a given machine translation (MT). Recent findings present that existing APE systems are not good at handling high-quality MTs even for a language pair with abundant data resources, English–German: the better the given MT is, the harder it is to decide what parts to edit and how to fix these errors. One possible solution to this problem is to instill deeper knowledge about the target language into the model. Thus, we propose a linguistically motivated method of regularization that is expected to enhance APE models’ understanding of the target language: a loss function that encourages symmetric self-attention on the given MT. Our analysis of experimental results demonstrates that the proposed method helps improving the state-of-the-art architecture’s APE quality for high-quality MTs.

2021

pdf bib abs
Adaptation of Back-translation to Automatic Post-Editing for Synthetic Data Generation
WonKee Lee | Baikjin Jung | Jaehun Shin | Jong-Hyeok Lee
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Automatic Post-Editing (APE) aims to correct errors in the output of a given machine translation (MT) system. Although data-driven approaches have become prevalent also in the APE task as in many other NLP tasks, there has been a lack of qualified training data due to the high cost of manual construction. eSCAPE, a synthetic APE corpus, has been widely used to alleviate the data scarcity, but it might not address genuine APE corpora’s characteristic that the post-edited sentence should be a minimally edited revision of the given MT output. Therefore, we propose two new methods of synthesizing additional MT outputs by adapting back-translation to the APE task, obtaining robust enlargements of the existing synthetic APE training dataset. Experimental results on the WMT English-German APE benchmarks demonstrate that our enlarged datasets are effective in improving APE performance.

pdf bib abs
Quality Estimation Using Dual Encoders with Transfer Learning
Dam Heo | WonKee Lee | Baikjin Jung | Jong-Hyeok Lee
Proceedings of the Sixth Conference on Machine Translation

This paper describes POSTECH’s quality estimation systems submitted to Task 2 of the WMT 2021 quality estimation shared task: Word and Sentence-Level Post-editing Effort. We notice that it is possible to improve the stability of the latest quality estimation models that have only one encoder based on the self-attention mechanism to simultaneously process both of the two input data, a source sequence and its machine translation, in that such models have neglected to take advantage of pre-trained monolingual representations, which are generally accepted as reliable representations for various natural language processing tasks. Therefore, our model uses two pre-trained monolingual encoders and then exchanges the information of two encoded representations through two additional cross attention networks. According to the official leaderboard, our systems outperform the baseline systems in terms of the Matthews correlation coefficient for machine translations’ word-level quality estimation and in terms of the Pearson’s correlation coefficient for sentence-level quality estimation by 0.4126 and 0.5497 respectively.

pdf bib abs
Transformer-based Screenplay Summarization Using Augmented Learning Representation with Dialogue Information
Myungji Lee | Hongseok Kwon | Jaehun Shin | WonKee Lee | Baikjin Jung | Jong-Hyeok Lee
Proceedings of the Third Workshop on Narrative Understanding

Screenplay summarization is the task of extracting informative scenes from a screenplay. The screenplay contains turning point (TP) events that change the story direction and thus define the story structure decisively. Accordingly, this task can be defined as the TP identification task. We suggest using dialogue information, one attribute of screenplays, motivated by previous work that discovered that TPs have a relation with dialogues appearing in screenplays. To teach a model this characteristic, we add a dialogue feature to the input embedding. Moreover, in an attempt to improve the model architecture of previous studies, we replace LSTM with Transformer. We observed that the model can better identify TPs in a screenplay by using dialogue information and that a model adopting Transformer outperforms LSTM-based models.

pdf bib abs
Tag Assisted Neural Machine Translation of Film Subtitles
Aren Siekmeier | WonKee Lee | Hongseok Kwon | Jong-Hyeok Lee
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

We implemented a neural machine translation system that uses automatic sequence tagging to improve the quality of translation. Instead of operating on unannotated sentence pairs, our system uses pre-trained tagging systems to add linguistic features to source and target sentences. Our proposed neural architecture learns a combined embedding of tokens and tags in the encoder, and simultaneous token and tag prediction in the decoder. Compared to a baseline with unannotated training, this architecture increased the BLEU score of German to English film subtitle translation outputs by 1.61 points using named entity tags; however, the BLEU score decreased by 0.38 points using part-of-speech tags. This demonstrates that certain token-level tag outputs from off-the-shelf tagging systems can improve the output of neural translation systems using our combined embedding and simultaneous decoding extensions.

2020

pdf bib abs
POSTECH Submission on Duolingo Shared Task
Junsu Park | Hongseok Kwon | Jong-Hyeok Lee
Proceedings of the Fourth Workshop on Neural Generation and Translation

In this paper, we propose a transfer learning based simultaneous translation model by extending BART. We pre-trained BART with Korean Wikipedia and a Korean news dataset, and fine-tuned with an additional web-crawled parallel corpus and the 2020 Duolingo official training dataset. In our experiments on the 2020 Duolingo test dataset, our submission achieves 0.312 in weighted macro F1 score, and ranks second among the submitted En-Ko systems.

pdf bib abs
POSTECH-ETRI’s Submission to the WMT2020 APE Shared Task: Automatic Post-Editing with Cross-lingual Language Model
Jihyung Lee | WonKee Lee | Jaehun Shin | Baikjin Jung | Young-Kil Kim | Jong-Hyeok Lee
Proceedings of the Fifth Conference on Machine Translation

This paper describes POSTECH-ETRI’s submission to WMT2020 for the shared task on automatic post-editing (APE) for 2 language pairs: English-German (En-De) and English-Chinese (En-Zh). We propose APE systems based on a cross-lingual language model, which jointly adopts translation language modeling (TLM) and masked language modeling (MLM) training objectives in the pre-training stage; the APE models then utilize jointly learned language representations between the source language and the target language. In addition, we created 19 million new sythetic triplets as additional training data for our final ensemble model. According to experimental results on the WMT2020 APE development data set, our models showed an improvement over the baseline by TER of -3.58 and a BLEU score of +5.3 for the En-De subtask; and TER of -5.29 and a BLEU score of +7.32 for the En-Zh subtask.

pdf bib abs
Noising Scheme for Data Augmentation in Automatic Post-Editing
WonKee Lee | Jaehun Shin | Baikjin Jung | Jihyung Lee | Jong-Hyeok Lee
Proceedings of the Fifth Conference on Machine Translation

This paper describes POSTECH’s submission to WMT20 for the shared task on Automatic Post-Editing (APE). Our focus is on increasing the quantity of available APE data to overcome the shortage of human-crafted training data. In our experiment, we implemented a noising module that simulates four types of post-editing errors, and we introduced this module into a Transformer-based multi-source APE model. Our noising module implants errors into texts on the target side of parallel corpora during the training phase to make synthetic MT outputs, increasing the entire number of training samples. We also generated additional training data using the parallel corpora and NMT model that were released for the Quality Estimation task, and we used these data to train our APE model. Experimental results on the WMT20 English-German APE data set show improvements over the baseline in terms of both the TER and BLEU scores: our primary submission achieved an improvement of -3.15 TER and +4.01 BLEU, and our contrastive submission achieved an improvement of -3.34 TER and +4.30 BLEU.

2019

pdf bib abs
Decay-Function-Free Time-Aware Attention to Context and Speaker Indicator for Spoken Language Understanding
Jonggu Kim | Jong-Hyeok Lee
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

To capture salient contextual information for spoken language understanding (SLU) of a dialogue, we propose time-aware models that automatically learn the latent time-decay function of the history without a manual time-decay function. We also propose a method to identify and label the current speaker to improve the SLU accuracy. In experiments on the benchmark dataset used in Dialog State Tracking Challenge 4, the proposed models achieved significantly higher F1 scores than the state-of-the-art contextual models. Finally, we analyze the effectiveness of the introduced models in detail. The analysis demonstrates that the proposed methods were effective to improve SLU accuracy individually.

pdf bib abs
Transformer-based Automatic Post-Editing Model with Joint Encoder and Multi-source Attention of Decoder
WonKee Lee | Jaehun Shin | Jong-Hyeok Lee
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

This paper describes POSTECH’s submission to the WMT 2019 shared task on Automatic Post-Editing (APE). In this paper, we propose a new multi-source APE model by extending Transformer. The main contributions of our study are that we 1) reconstruct the encoder to generate a joint representation of translation (mt) and its src context, in addition to the conventional src encoding and 2) suggest two types of multi-source attention layers to compute attention between two outputs of the encoder and the decoder state in the decoder. Furthermore, we train our model by applying various teacher-forcing ratios to alleviate exposure bias. Finally, we adopt the ensemble technique across variations of our model. Experiments on the WMT19 English-German APE data set show improvements in terms of both TER and BLEU scores over the baseline. Our primary submission achieves -0.73 in TER and +1.49 in BLEU compare to the baseline.

2018

pdf bib abs
Multi-encoder Transformer Network for Automatic Post-Editing
Jaehun Shin | Jong-Hyeok Lee
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the POSTECH’s submission to the WMT 2018 shared task on Automatic Post-Editing (APE). We propose a new neural end-to-end post-editing model based on the transformer network. We modified the encoder-decoder attention to reflect the relation between the machine translation output, the source and the post-edited translation in APE problem. Experiments on WMT17 English-German APE data set show an improvement in both TER and BLEU score over the best result of WMT17 APE shared task. Our primary submission achieves -4.52 TER and +6.81 BLEU score on PBSMT task and -0.13 TER and +0.40 BLEU score for NMT task compare to the baseline.

We participated in the IWSLT 2013 Evaluation Campaign for the MT track for two official directions: German↔English. Our system consisted of a reordering module and a statistical machine translation (SMT) module under a pre-ordering SMT framework. We trained the reordering module using three scalable methods in order to utilize training instances as many as possible. The translation quality of our primary submissions were comparable to that of a hierarchical phrasebased SMT, which usually requires a longer time to decode.

2012

pdf bib abs
Forest-to-string translation using binarized dependency forest for IWSLT 2012 OLYMPICS task
Hwidong Na | Jong-Hyeok Lee
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

We participated in the OLYMPICS task in IWSLT 2012 and submitted two formal runs using a forest-to-string translation system. Our primary run achieved better translation quality than our contrastive run, but worse than a phrase-based and a hierarchical system using Moses.

2011

pdf bib
Multi-Word Unit Dependency Forest-based Translation Rule Extraction
Hwidong Na | Jong-Hyeok Lee
Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Beyond Chart Parsing: An Analytic Comparison of Dependency Chart Parsing Algorithms
Meixun Jin | Hwidong Na | Jong-Hyeok Lee
Proceedings of the 12th International Conference on Parsing Technologies

2010

pdf bib abs
Transferring Syntactic Relations of Subject-Verb-Object Pattern in Chinese-to-Korean SMT
Jin-Ji Li | Jungi Kim | Jong-Hyeok Lee
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

Since most Korean postpositions signal grammatical functions such as syntactic relations, generation of incorrect Korean post-positions results in producing ungrammatical outputs in machine translations targeting Korean. Chinese and Korean belong to morphosyntactically divergent language pairs, and usually Korean postpositions do not have their counterparts in Chinese. In this paper, we propose a preprocessing method for a statistical MT system that generates more adequate Korean postpositions. We transfer syntactic relations of subject-verb-object patterns in Chinese sentences and enrich them with transferred syntactic relations in order to reduce the morpho-syntactic differences. The effectiveness of our proposed method is measured with lexical units of various granularities. Human evaluation also suggest improvements over previous methods, which are consistent with the result of the automatic evaluation.

pdf bib abs
Chinese Syntactic Reordering through Contrastive Analysis of Predicate-predicate Patterns in Chinese-to-Korean SMT
Jin-Ji Li | Jungi Kim | Jong-Hyeok Lee
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

We propose a Chinese dependency tree reordering method for Chinese-to-Korean SMT systems through analyzing systematic differences between the Chinese and Korean languages. Translating predicate-predicate patterns in Chinese into Korean raises various issues such as long-distance reordering. This paper concentrates on syntactic reordering of predicate-predicate patterns in Chinese dependency trees through contrastively analyzing construction types in Chinese and their corresponding translations in Korean. We explore useful linguistic knowledge that assists effective syntactic reordering of Chinese dependency trees; we design two experiments with different kinds of linguistic knowledge combined with the phrase and hierarchical phrase-based SMT systems, and assess the effectiveness of our proposed methods. The experiments achieved significant improvements by resolving the long-distance reordering problem.

pdf bib abs
A Synchronous Context Free Grammar using Dependency Sequence for Syntax-based Statistical Machine Translation
Hwidong Na | Jin-Ji Li | Yeha Lee | Jong-hyeok Lee
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Student Research Workshop

We introduce a novel translation rule that captures discontinuous, partial constituent, and non-projective phrases from source language. Using the traversal order sequences of the dependency tree, our proposed method 1) extracts the synchronous rules in linear time and 2) combines them efficiently using the CYK chart parsing algorithm. We analytically show the effectiveness of this translation rule in translating relatively free order sentences, and empirically investigate the coverage of our proposed method.

pdf bib
Evaluating Multilanguage-Comparability of Subjectivity Analysis Systems
Jungi Kim | Jin-Ji Li | Jong-Hyeok Lee
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
The POSTECH’s statistical machine translation system for the IWSLT 2010
Hwidong Na | Jong-Hyeok Lee
Proceedings of the 7th International Workshop on Spoken Language Translation: Evaluation Campaign

2009

pdf bib
Discovering the Discriminative Views: Measuring Term Weights for Sentiment Analysis
Jungi Kim | Jin-Ji Li | Jong-Hyeok Lee
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Chinese Syntactic Reordering for Adequate Generation of Korean Verbal Phrases in Chinese-to-Korean SMT
Jin-Ji Li | Jungi Kim | Dong-Il Kim | Jong-Hyeok Lee
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
Improving Fluency by Reordering Target Constituents using MST Parser in English-to-Japanese Phrase-based SMT
Hwidong Na | Jin-Ji Li | Jungi Kim | Jong-Hyeok Lee
Proceedings of Machine Translation Summit XII: Posters

pdf bib
Method of Extracting Is-A and Part-Of Relations Using Pattern Pairs in Mass Corpus
Se-Jong Kim | Yong-Hun Lee | Jong-Hyeok Lee
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

2008

pdf bib abs
Annotation Guidelines for Chinese-Korean Word Alignment
Jin-Ji Li | Dong-Il Kim | Jong-Hyeok Lee
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

For a language pair such as Chinese and Korean that belong to entirely different language families in terms of typology and genealogy, finding the correspondences is quite obscure in word alignment. We present annotation guidelines for Chinese-Korean word alignment through contrastive analysis of morpho-syntactic encodings. We discuss the differences in verbal systems that cause most of linking obscurities in annotation process. Systematic comparison of verbal systems is conducted by analyzing morpho-syntactic encodings. The viewpoint of grammatical category allows us to define consistent and systematic instructions for linguistically distant languages such as Chinese and Korean. The scope of our guidelines is limited to the alignment between Chinese and Korean, but the instruction methods exemplified in this paper are also applicable in developing systematic and comprehensible alignment guidelines for other languages having such different linguistic phenomena.

pdf bib
Search Result Clustering Using Label Language Model
Yeha Lee | Seung-Hoon Na | Jong-Hyeok Lee
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib
Automatic Extraction of English-Chinese Transliteration Pairs using Dynamic Window and Tokenizer
Chengguo Jin | Seung-Hoon Na | Dong-Il Kim | Jong-Hyeok Lee
Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing

2005

pdf bib
Chunking Using Conditional Random Fields in Korean Texts
Yong-Hun Lee | Mi-Young Kim | Jong-Hyeok Lee
Second International Joint Conference on Natural Language Processing: Full Papers

pdf bib
Two-Phase Shift-Reduce Deterministic Dependency Parser of Chinese
Meixun Jin | Mi-Young Kim | Jong-Hyeok Lee
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

2004

pdf bib
Segmentation of Chinese Long Sentences Using Commas
Meixun Jin | Mi-Young Kim | Dongil Kim | Jong-Hyeok Lee
Proceedings of the Third SIGHAN Workshop on Chinese Language Processing

pdf bib
Term Extraction from Korean Corpora via Japanese
Atsushi Fujii | Tetsuya Ishikawa | Jong-Hyeok Lee
Proceedings of CompuTerm 2004: 3rd International Workshop on Computational Terminology

2003

pdf bib
An empirical study for generating zero pronoun in Korean based on Cost-based centering model
Ji-Eun Roh | Jong-Hyeok Lee
Proceedings of the Australasian Language Technology Workshop 2003

pdf bib
S-clause segmentation for efficient syntactic analysis using decision trees
Mi-Young Kim | Jong-Hyeok Lee
Proceedings of the Australasian Language Technology Workshop 2003

pdf bib
Resolving Sense Ambiguity of Korean Nouns Based on Concept Co-occurrence Information
You-Jin Chung | Jong-Hyeok Lee
Proceedings of the Australasian Language Technology Workshop 2003

pdf bib
Conceptual Schema Approach to Natural Language Database Access
In-Su Kang | Seung-Hoon Na | Jong-Hyeok Lee
Proceedings of the Australasian Language Technology Workshop 2003

2002

pdf bib
Syllable-Pattern-Based Unknown-Morpheme Segmentation and Estimation for Hybrid Part-of-Speech Tagging of Korean
Gary Geunbae Lee | Jeongwon Cha | Jong-Hyeok Lee
Computational Linguistics, Volume 28, Number 1, March 2002

pdf bib
Word Sense Disambiguation in a Korean-to-Japanese MT System Using Neural Networks
You-Jin Chung | Sin-Jae Kang | Kyong-Hi Moon | Jong-Hyeok Lee
COLING-02: Machine Translation in Asia

pdf bib
A Knowledge Based Approach to Identification of Serial Verb Construction in Chinese-to-Korean Machine Translation System
Dong-il Kim | Zheng Cui | Jinji Li | Jong-Hyeok Lee
COLING-02: The First SIGHAN Workshop on Chinese Language Processing

2001

pdf bib
Semi-Automatic Practical Ontology Construction by Using a Thesaurus, Computational Dictionaries, and Large Corpora
Sin-Jae Kang | Jong-Hyeok Lee
Proceedings of the ACL 2001 Workshop on Human Language Technology and Knowledge Management

pdf bib abs
Ontology-based word sense disambiguation using semi-automatically constructed ontology
Sin-Jae Kang | Jong-Hyeok Lee
Proceedings of Machine Translation Summit VIII

This paper describes a method for disambiguating word senses by using semi-automatically constructed ontology. The ontology stores rich semantic constraints among 1,110 concepts, and enables a natural language processing system to resolve semantic ambiguities by making inferences with the concept network of the ontology. In order to acquire a reasonably practical ontology in limited time and with less manpower, we extend the existing Kadokawa thesaurus by inserting additional semantic relations into its hierarchy, which are classified as case relations and other semantic relations. The former can be obtained by converting valency information and case frames from previously-built electronic dictionaries used in machine translation. The latter can be acquired from concept co-occurrence information, which is extracted automatically from large corpora. In our practical machine translation system, our word sense disambiguation method achieved a 9.2% improvement over methods which do not use an ontology for Korean translation.

2000

pdf bib
Representation and Recognition Method for Multi-Word Translation Units in Korean-to-Japanese MT System
Kyonghi Moon | Jong-Hyeok Lee
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

1999

pdf bib abs
The use of abstracted knowledge from an automatically sense-tagged corpus for lexical transfer ambiguity resolution
Hui-Feng Li | Namwon Heo. Kyounghi Moon | Jong-Hyeok Lee
Proceedings of Machine Translation Summit VII

This paper proposes a method for lexical transfer ambiguity resolution using corpus and conceptual information. Previous researches have restricted the use of linguistic knowledge to the lexical level. Since the extracted knowledge is stored in words themselves, these methods require a large amount of space with a low recall rate. On the contrary, we resolve word sense ambiguity by using concept co-occurrence information extracted from an automatically sense-tagged corpus. In one experiment, it achieved, on average, a precision of 82.4% for nominal words, and 83% for verbal words. Considering that the test corpus is completely irrelevant to the learning corpus, this is a promising result.