Bo Ma

Also published as:


2025

pdf bib
OpenForecast: A Large-Scale Open-Ended Event Forecasting Dataset
Zhen Wang | Xi Zhou | Yating Yang | Bo Ma | Lei Wang | Rui Dong | Azmat Anwar
Proceedings of the 31st International Conference on Computational Linguistics

Complex events generally exhibit unforeseen, multifaceted, and multi-step developments, and cannot be well handled by existing closed-ended event forecasting methods, which are constrained by a limited answer space. In order to accelerate the research on complex event forecasting, we introduce OpenForecast, a large-scale open-ended dataset with two features: (1) OpenForecast defines three open-ended event forecasting tasks, enabling unforeseen, multifaceted, and multi-step forecasting. (2) OpenForecast collects and annotates a large-scale dataset from Wikipedia and news, including 43,419 complex events spanning from 1950 to 2024. Particularly, this annotation can be completed automatically without any manual annotation cost. Meanwhile, we introduce an automatic LLM-based Retrieval-Augmented Evaluation method (LRAE) for complex events, enabling OpenForecast to evaluate the ability of complex event forecasting of large language models. Finally, we conduct comprehensive human evaluations to verify the quality and challenges of OpenForecast, and the consistency between LEAE metric and human evaluation. OpenForecast and related codes will be publicly released.

pdf bib
Low-Resource Language Expansion and Translation Capacity Enhancement for LLM: A Study on the Uyghur
Kaiwen Lu | Yating Yang | Fengyi Yang | Rui Dong | Bo Ma | Aihetamujiang Aihemaiti | Abibilla Atawulla | Lei Wang | Xi Zhou
Proceedings of the 31st International Conference on Computational Linguistics

Although large language models have significantly advanced natural language generation, their potential in low-resource machine translation has not yet been fully explored, especially for languages that translation models have not been trained on. In this study, we provide a detailed demonstration of how to efficiently expand low-resource languages for large language models and significantly enhance the model’s translation ability, using Uyghur as an example. The process involves four stages: collecting and pre-processing monolingual data, conducting continuous pre-training with extensive monolingual data, fine-tuning with less parallel corpora using translation supervision, and proposing a direct preference optimization based on translation self-evolution (DPOSE) on this basis. Extensive experiments have shown that our strategy effectively expands the low-resource languages supported by large language models and significantly enhances the model’s translation ability in Uyghur with less parallel data. Our research provides detailed insights for expanding other low-resource languages into large language models.

2022

pdf bib
ASCM: An Answer Space Clustered Prompting Method without Answer Engineering
Zhen Wang | Yating Yang | Zhou Xi | Bo Ma | Lei Wang | Rui Dong | Azmat Anwar
Findings of the Association for Computational Linguistics: ACL 2022

Prompt-based learning, which exploits knowledge from pre-trained language models by providing textual prompts and designing appropriate answer-category mapping methods, has achieved impressive successes on few-shot text classification and natural language inference (NLI). Because of the diverse linguistic expression, there exist many answer tokens for the same category. However, both manual answer design and automatic answer search constrain answer space and therefore hardly achieve ideal performance. To address this issue, we propose an answer space clustered prompting model (ASCM) together with a synonym initialization method (SI) which automatically categorizes all answer tokens in a semantic-clustered embedding space. We also propose a stable semi-supervised method named stair learning (SL) that orderly distills knowledge from better models to weaker models. Extensive experiments demonstrate that our ASCM+SL significantly outperforms existing state-of-the-art techniques in few-shot settings.

2021

pdf bib
基于时间注意力胶囊网络的维吾尔语情感分类模型(Uyghur Sentiment Classification Model Based on Temporal Attention Capsule Networks)
Hantian Luo (罗涵天) | Yating Yang (杨雅婷) | Rui Dong (董瑞) | Bo Ma (马博)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

维吾尔语属于稀缺资源语言,如何在资源有限的情况下提升维吾尔语情感分类模型的性能,是目前待解决的问题。本文针对现有维吾尔语情感分析因为泛化能力不足所导致的分类效果不佳的问题,提出了基于时间卷积注意力胶囊网络的维吾尔语情感分类模型匨協十匭千卡印匩。本文在维吾尔语情感分类数据集中进行了实验并且从多个评价指标(准确率,精确率,召回率,F1值)进行评估,实验结果表明本文提出的模型相比传统深度学习模型可以有效提升维吾尔语情感分类的各项指标。