Chao-Hong Liu


2021

pdf bib
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)
John Ortega | Atul Kr. Ojha | Katharina Kann | Chao-Hong Liu
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)

pdf bib
Findings of the LoResMT 2021 Shared Task on COVID and Sign Language for Low-resource Languages
Atul Kr. Ojha | Chao-Hong Liu | Katharina Kann | John Ortega | Sheetal Shatam | Theodorus Fransen
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)

We present the findings of the LoResMT 2021 shared task which focuses on machine translation (MT) of COVID-19 data for both low-resource spoken and sign languages. The organization of this task was conducted as part of the fourth workshop on technologies for machine translation of low resource languages (LoResMT). Parallel corpora is presented and publicly available which includes the following directions: English↔Irish, English↔Marathi, and Taiwanese Sign language↔Traditional Chinese. Training data consists of 8112, 20933 and 128608 segments, respectively. There are additional monolingual data sets for Marathi and English that consist of 21901 segments. The results presented here are based on entries from a total of eight teams. Three teams submitted systems for English↔Irish while five teams submitted systems for English↔Marathi. Unfortunately, there were no systems submissions for the Taiwanese Sign language↔Traditional Chinese task. Maximum system performance was computed using BLEU and follow as 36.0 for English–Irish, 34.6 for Irish–English, 24.2 for English–Marathi, and 31.3 for Marathi–English.

2020

pdf bib
Multiple Segmentations of Thai Sentences for Neural Machine Translation
Alberto Poncelas | Wichaya Pidchamook | Chao-Hong Liu | James Hadley | Andy Way
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Thai is a low-resource language, so it is often the case that data is not available in sufficient quantities to train an Neural Machine Translation (NMT) model which perform to a high level of quality. In addition, the Thai script does not use white spaces to delimit the boundaries between words, which adds more complexity when building sequence to sequence models. In this work, we explore how to augment a set of English–Thai parallel data by replicating sentence-pairs with different word segmentation methods on Thai, as training data for NMT model training. Using different merge operations of Byte Pair Encoding, different segmentations of Thai sentences can be obtained. The experiments show that combining these datasets, performance is improved for NMT models trained with a dataset that has been split using a supervised splitting tool.

pdf bib
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages
Alina Karakanta | Atul Kr. Ojha | Chao-Hong Liu | Jade Abbott | John Ortega | Jonathan Washington | Nathaniel Oco | Surafel Melaku Lakew | Tommi A Pirinen | Valentin Malykh | Varvara Logacheva | Xiaobing Zhao
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

pdf bib
Findings of the LoResMT 2020 Shared Task on Zero-Shot for Low-Resource languages
Atul Kr. Ojha | Valentin Malykh | Alina Karakanta | Chao-Hong Liu
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

This paper presents the findings of the LoResMT 2020 Shared Task on zero-shot translation for low resource languages. This task was organised as part of the 3rd Workshop on Technologies for MT of Low Resource Languages (LoResMT) at AACL-IJCNLP 2020. The focus was on the zero-shot approach as a notable development in Neural Machine Translation to build MT systems for language pairs where parallel corpora are small or even non-existent. The shared task experience suggests that back-translation and domain adaptation methods result in better accuracy for small-size datasets. We further noted that, although translation between similar languages is no cakewalk, linguistically distinct languages require more data to give better results.

2019

pdf bib
Pivot Machine Translation in INTERACT Project
Chao-Hong Liu | Andy Way | Catarina Silva | André Martins
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf bib
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages
Alina Karakanta | Atul Kr. Ojha | Chao-Hong Liu | Jonathan Washington | Nathaniel Oco | Surafel Melaku Lakew | Valentin Malykh | Xiaobing Zhao
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages

2018

pdf bib
Chinese-Portuguese Machine Translation: A Study on Building Parallel Corpora from Comparable Texts
Siyou Liu | Longyue Wang | Chao-Hong Liu
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
The RGNLP Machine Translation Systems for WAT 2018
Atul Kr. Ojha | Koel Dutta Chowdhury | Chao-Hong Liu | Karan Saxena
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation

pdf bib
Proceedings of the AMTA 2018 Workshop on Technologies for MT of Low Resource Languages (LoResMT 2018)
Chao-Hong Liu
Proceedings of the AMTA 2018 Workshop on Technologies for MT of Low Resource Languages (LoResMT 2018)

pdf bib
Extracting In-domain Training Corpora for Neural Machine Translation Using Data Selection Methods
Catarina Cruz Silva | Chao-Hong Liu | Alberto Poncelas | Andy Way
Proceedings of the Third Conference on Machine Translation: Research Papers

Data selection is a process used in selecting a subset of parallel data for the training of machine translation (MT) systems, so that 1) resources for training might be reduced, 2) trained models could perform better than those trained with the whole corpus, and/or 3) trained models are more tailored to specific domains. It has been shown that for statistical MT (SMT), the use of data selection helps improve the MT performance significantly. In this study, we reviewed three data selection approaches for MT, namely Term Frequency– Inverse Document Frequency, Cross-Entropy Difference and Feature Decay Algorithm, and conducted experiments on Neural Machine Translation (NMT) with the selected data using the three approaches. The results showed that for NMT systems, using data selection also improved the performance, though the gain is not as much as for SMT systems.

2017

pdf bib
Ethical Considerations in NLP Shared Tasks
Carla Parra Escartín | Wessel Reijers | Teresa Lynn | Joss Moorkens | Andy Way | Chao-Hong Liu
Proceedings of the First ACL Workshop on Ethics in Natural Language Processing

Shared tasks are increasingly common in our field, and new challenges are suggested at almost every conference and workshop. However, as this has become an established way of pushing research forward, it is important to discuss how we researchers organise and participate in shared tasks, and make that information available to the community to allow further research improvements. In this paper, we present a number of ethical issues along with other areas of concern that are related to the competitive nature of shared tasks. As such issues could potentially impact on research ethics in the Natural Language Processing community, we also propose the development of a framework for the organisation of and participation in shared tasks that can help mitigate against these issues arising.

pdf bib
Proceedings of the First Workshop on Curation and Applications of Parallel and Comparable Corpora
Haithem Afli | Chao-Hong Liu
Proceedings of the First Workshop on Curation and Applications of Parallel and Comparable Corpora

pdf bib
Proceedings of the IJCNLP 2017, Shared Tasks
Chao-Hong Liu | Preslav Nakov | Nianwen Xue
Proceedings of the IJCNLP 2017, Shared Tasks

pdf bib
IJCNLP-2017 Task 4: Customer Feedback Analysis
Chao-Hong Liu | Yasufumi Moriya | Alberto Poncelas | Declan Groves
Proceedings of the IJCNLP 2017, Shared Tasks

This document introduces the IJCNLP 2017 Shared Task on Customer Feedback Analysis. In this shared task we have prepared corpora of customer feedback in four languages, i.e. English, French, Spanish and Japanese. They were annotated in a common meanings categorization, which was improved from an ADAPT-Microsoft pivot study on customer feedback. Twenty teams participated in the shared task and twelve of them have submitted prediction results. The results show that performance of prediction meanings of customer feedback is reasonable well in four languages. Nine system description papers are archived in the shared tasks proceeding.

2016

pdf bib
The ADAPT Bilingual Document Alignment system at WMT16
Pintu Lohar | Haithem Afli | Chao-Hong Liu | Andy Way
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

2013

pdf bib
Candidate Scoring Using Web-Based Measure for Chinese Spelling Error Correction
Liang-Chih Yu | Chao-Hong Liu | Chung-Hsien Wu
Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing