Wei Bao
2025
基于多样性数据重组增强的藏汉神经机器翻译
Jiayi Xue | Jinming Chen | Bo Chen | Wei Bao | Xiaobing Zhao
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)
Jiayi Xue | Jinming Chen | Bo Chen | Wei Bao | Xiaobing Zhao
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)
"高资源语言的神经机器翻译虽已取得显著进展,但低资源语言面临更严重的平行数据不足的问题。为此,提出一种面向藏汉神经机器翻译的多样性数据重组增强方法(DiRec)。该方法利用大语言模型的双向语言能力,对已有藏汉平行数据进行成分重组、句型重组和风格重组三种数据重组,经过两轮质量自动筛选后得到多样性增强数据。在藏汉机器翻译的实验中,相较于基线模型,基于DiRec的模型的泛化能力指标提升4.83个百分点,BLEU提高0.55,chrF++提高0.20。最后分析了不同数据重组方式对翻译模型性能的影响。"
2020
Will_Go at SemEval-2020 Task 3: An Accurate Model for Predicting the (Graded) Effect of Context in Word Similarity Based on BERT
Wei Bao | Hongshu Che | Jiandong Zhang
Proceedings of the Fourteenth Workshop on Semantic Evaluation
Wei Bao | Hongshu Che | Jiandong Zhang
Proceedings of the Fourteenth Workshop on Semantic Evaluation
Natural Language Processing (NLP) has been widely used in the semantic analysis in recent years. Our paper mainly discusses a methodology to analyze the effect that context has on human perception of similar words, which is the third task of SemEval 2020. We apply several methods in calculating the distance between two embedding vector generated by Bidirectional Encoder Representation from Transformer (BERT). Our team will go won the 1st place in Finnish language track of subtask1, the second place in English track of subtask1.
Will_go at SemEval-2020 Task 9: An Accurate Approach for Sentiment Analysis on Hindi-English Tweets Based on Bert and Pesudo Label Strategy
Wei Bao | Weilong Chen | Wei Bai | Yan Zhuang | Mingyuan Cheng | Xiangyu Ma
Proceedings of the Fourteenth Workshop on Semantic Evaluation
Wei Bao | Weilong Chen | Wei Bai | Yan Zhuang | Mingyuan Cheng | Xiangyu Ma
Proceedings of the Fourteenth Workshop on Semantic Evaluation
Mixing languages are widely used in social media, especially in multilingual societies like India. Detecting the emotions contained in these languages, which is of great significance to the development of society and political trends. In this paper, we propose an ensemble of pesudo-label based Bert model and TFIDF based SGDClassifier model to identify the sentiments of Hindi-English (Hi-En) code-mixed data. The ensemble model combines the strengths of rich semantic information from the Bert model and word frequency information from the probabilistic ngram model to predict the sentiment of a given code-mixed tweet. Finally our team got an average F1 score of 0.731 on the final leaderboard,and our codalab username is will_go.