Si Shen


2024

pdf bib
Overview of EvaHan2024: The First International Evaluation on Ancient Chinese Sentence Segmentation and Punctuation
Bin Li | Bolin Chang | Zhixing Xu | Minxuan Feng | Chao Xu | Weiguang Qu | Si Shen | Dongbo Wang
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024

Ancient Chinese texts have no sentence boundaries and punctuation. Adding modern Chinese punctuation to theses texts requires expertise, time and efforts. Automatic sentence segmentation and punctuation is considered as a basic task for Ancient Chinese processing, but there is no shared task to evaluate the performances of different systems. This paper presents the results of the first ancient Chinese sentence segmentation and punctuation bakeoff, which is held at the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) 2024. The contest uses metrics for detailed evaluations of 4 genres of unpublished texts with 11 punctuation types. Six teams submitted 32 running results. In the closed modality, the participants are only allowed to use the training data, the highest obtained F1 scores are respectively 88.47% and 75.29% in sentence segmentation and sentence punctuation. The perfermances on the unseen data is 10 percent lower than the published common data, which means there is still space for further improvement. The large language models outperform the traditional models, but LLM changes the original characters around 1-2%, due to over-generation. Thus, post-processing is needed to keep the text consistancy.

2023

pdf bib
EvaHan2023: Overview of the First International Ancient Chinese Translation Bakeoff
Dongbo Wang | Litao Lin | Zhixiao Zhao | Wenhao Ye | Kai Meng | Wenlong Sun | Lianzhen Zhao | Xue Zhao | Si Shen | Wei Zhang | Bin Li
Proceedings of ALT2023: Ancient Language Translation Workshop

This paper present the results of the First International Ancient Chinese Transalation Bakeoff (EvaHan), which is a shared task of the Ancient Language Translation Workshop (ALT2023) and a co-located event of the 19th Edition of the Machine Translation Summit 2023 (MTS 2023). We described the motivation for having an international shared contest, as well as the datasets and tracks. The contest consists of two modalities, closed and open. In the closed modality, the participants are only allowed to use the training data, the partic-ipating teams achieved the highest BLEU scores of 27.3315 and 1.1102 in the tasks of translating Ancient Chinese to Modern Chinese and translating Ancient Chinese to English, respectively. In the open mode, contestants can only use any available data and models. The participating teams achieved the highest BLEU scores of 29.6832 and 6.5493 in the ancient Chinese to modern and ancient Chinese to English tasks, respectively.

2022

pdf bib
Increasing Visual Awareness in Multimodal Neural Machine Translation from an Information Theoretic Perspective
Baijun Ji | Tong Zhang | Yicheng Zou | Bojie Hu | Si Shen
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Multimodal machine translation (MMT) aims to improve translation quality by equipping the source sentence with its corresponding image. Despite the promising performance, MMT models still suffer the problem of input degradation: models focus more on textual information while visual information is generally overlooked. In this paper, we endeavor to improve MMT performance by increasing visual awareness from an information theoretic perspective. In detail, we decompose the informative visual signals into two parts: source-specific information and target-specific information. We use mutual information to quantify them and propose two methods for objective optimization to better leverage visual signals. Experiments on two datasets demonstrate that our approach can effectively enhance the visual awareness of MMT model and achieve superior results against strong baselines.