Domain-independent Punctuation and Segmentation Insertion

Eunah Cho, Jan Niehues, Alex Waibel


Abstract
Punctuation and segmentation is crucial in spoken language translation, as it has a strong impact to translation performance. However, the impact of rare or unknown words in the performance of punctuation and segmentation insertion has not been thoroughly studied. In this work, we simulate various degrees of domain-match in testing scenario and investigate their impact to the punctuation insertion task. We explore three rare word generalizing schemes using part-of-speech (POS) tokens. Experiments show that generalizing rare and unknown words greatly improves the punctuation insertion performance, reaching up to 8.8 points of improvement in F-score when applied to the out-of-domain test scenario. We show that this improvement in punctuation quality has a positive impact on a following machine translation (MT) performance, improving it by 2 BLEU points.
Anthology ID:
2017.iwslt-1.11
Volume:
Proceedings of the 14th International Conference on Spoken Language Translation
Month:
December 14-15
Year:
2017
Address:
Tokyo, Japan
Editors:
Sakriani Sakti, Masao Utiyama
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
International Workshop on Spoken Language Translation
Note:
Pages:
74–81
Language:
URL:
https://aclanthology.org/2017.iwslt-1.11
DOI:
Bibkey:
Cite (ACL):
Eunah Cho, Jan Niehues, and Alex Waibel. 2017. Domain-independent Punctuation and Segmentation Insertion. In Proceedings of the 14th International Conference on Spoken Language Translation, pages 74–81, Tokyo, Japan. International Workshop on Spoken Language Translation.
Cite (Informal):
Domain-independent Punctuation and Segmentation Insertion (Cho et al., IWSLT 2017)
Copy Citation:
PDF:
https://aclanthology.org/2017.iwslt-1.11.pdf