Yating Yang

Also published as: 雅婷杨

2025

Complex events generally exhibit unforeseen, multifaceted, and multi-step developments, and cannot be well handled by existing closed-ended event forecasting methods, which are constrained by a limited answer space. In order to accelerate the research on complex event forecasting, we introduce OpenForecast, a large-scale open-ended dataset with two features: (1) OpenForecast defines three open-ended event forecasting tasks, enabling unforeseen, multifaceted, and multi-step forecasting. (2) OpenForecast collects and annotates a large-scale dataset from Wikipedia and news, including 43,419 complex events spanning from 1950 to 2024. Particularly, this annotation can be completed automatically without any manual annotation cost. Meanwhile, we introduce an automatic LLM-based Retrieval-Augmented Evaluation method (LRAE) for complex events, enabling OpenForecast to evaluate the ability of complex event forecasting of large language models. Finally, we conduct comprehensive human evaluations to verify the quality and challenges of OpenForecast, and the consistency between LEAE metric and human evaluation. OpenForecast and related codes will be publicly released.

pdf bib abs

Although large language models have significantly advanced natural language generation, their potential in low-resource machine translation has not yet been fully explored, especially for languages that translation models have not been trained on. In this study, we provide a detailed demonstration of how to efficiently expand low-resource languages for large language models and significantly enhance the model’s translation ability, using Uyghur as an example. The process involves four stages: collecting and pre-processing monolingual data, conducting continuous pre-training with extensive monolingual data, fine-tuning with less parallel corpora using translation supervision, and proposing a direct preference optimization based on translation self-evolution (DPOSE) on this basis. Extensive experiments have shown that our strategy effectively expands the low-resource languages supported by large language models and significantly enhances the model’s translation ability in Uyghur with less parallel data. Our research provides detailed insights for expanding other low-resource languages into large language models.

pdf bib abs

Event forecasting requires modeling historical event data to predict future events, and achieving accurate predictions depends on effectively capturing the relevant historical information that aids forecasting. Most existing methods focus on entities and structural dependencies to capture historical clues but often overlook implicitly relevant information. This limitation arises from overlooking event semantics and deeper factual associations that are not explicitly connected in the graph structure but are nonetheless critical for accurate forecasting. To address this, we propose a dual-criteria constraint strategy that leverages event semantics for relevance modeling and incorporates a self-supervised semantic filter based on factual event associations to capture implicitly relevant historical information. Building on this strategy, our method, termed ITHI (Integrating Three types of Historical Information), combines sequential event information, periodically repeated event information, and relevant historical information to achieve context-aware event forecasting. We evaluated the proposed ITHI method on three public benchmark datasets, achieving state-of-the-art performance and significantly outperforming existing approaches. Additionally, we validated its effectiveness on two structured temporal knowledge graph forecasting dataset.

pdf bib abs

Large Language Models (LLMs) exhibit strong reasoning capabilities and are widely applied in event forecasting. However, studies have demonstrated that LLMs exhibit human-like cognitive biases, systematic patterns of deviation from rationality in decision-making. To explore the cognitive biases in event forecasting, we introduce CogForecast, a human-curated dataset comprising six topics. Experimental results on three LLMs reveal significant cognitive biases in LLM-based event forecasting methods. To address this issue, we propose MCA, a Multi-Cognition Agentic framework. Specifically, MCA leverages LLMs to act as multi-cognition event participants, performing perspective-taking based on the cognitive patterns of event participants to alleviate the inherent cognitive biases in LLMs and offer diverse analytical perspectives. Then, MCA clusters agents according to their predictions and derives a final answer through a group-level reliability scoring method. Experimental results on a dataset including eight event categories demonstrate the effectiveness of MCA. Using Llama-3.1-70B, MCA achieves an accuracy of 82.3% (79.5% for the human crowd). Additionally, we demonstrate that MCA can alleviate the cognitive biases in LLMs and investigate three influencing factors.

2022

pdf bib abs

Prompt-based learning, which exploits knowledge from pre-trained language models by providing textual prompts and designing appropriate answer-category mapping methods, has achieved impressive successes on few-shot text classification and natural language inference (NLI). Because of the diverse linguistic expression, there exist many answer tokens for the same category. However, both manual answer design and automatic answer search constrain answer space and therefore hardly achieve ideal performance. To address this issue, we propose an answer space clustered prompting model (ASCM) together with a synonym initialization method (SI) which automatically categorizes all answer tokens in a semantic-clustered embedding space. We also propose a stable semi-supervised method named stair learning (SL) that orderly distills knowledge from better models to weaker models. Extensive experiments demonstrate that our ASCM+SL significantly outperforms existing state-of-the-art techniques in few-shot settings.

2021

pdf bib abs

基于时间注意力胶囊网络的维吾尔语情感分类模型(Uyghur Sentiment Classification Model Based on Temporal Attention Capsule Networks)
Hantian Luo (罗涵天) | Yating Yang (杨雅婷) | Rui Dong (董瑞) | Bo Ma (马博)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

维吾尔语属于稀缺资源语言,如何在资源有限的情况下提升维吾尔语情感分类模型的性能,是目前待解决的问题。本文针对现有维吾尔语情感分析因为泛化能力不足所导致的分类效果不佳的问题,提出了基于时间卷积注意力胶囊网络的维吾尔语情感分类模型匨協十匭千卡印匩。本文在维吾尔语情感分类数据集中进行了实验并且从多个评价指标(准确率,精确率,召回率,F1值)进行评估,实验结果表明本文提出的模型相比传统深度学习模型可以有效提升维吾尔语情感分类的各项指标。

2020

pdf bib abs

Multi-Task Neural Model for Agglutinative Language Translation
Yirong Pan | Xiao Li | Yating Yang | Rui Dong
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Neural machine translation (NMT) has achieved impressive performance recently by using large-scale parallel corpora. However, it struggles in the low-resource and morphologically-rich scenarios of agglutinative language translation task. Inspired by the finding that monolingual data can greatly improve the NMT performance, we propose a multi-task neural model that jointly learns to perform bi-directional translation and agglutinative language stemming. Our approach employs the shared encoder and decoder to train a single model without changing the standard NMT architecture but instead adding a token before each source-side sentence to specify the desired target outputs of the two different tasks. Experimental results on Turkish-English and Uyghur-Chinese show that our proposed approach can significantly improve the translation performance on agglutinative languages by using a small amount of monolingual data.

2018

pdf bib abs

Toward Better Loanword Identification in Uyghur Using Cross-lingual Word Embeddings
Chenggang Mi | Yating Yang | Lei Wang | Xi Zhou | Tonghai Jiang
Proceedings of the 27th International Conference on Computational Linguistics

To enrich vocabulary of low resource settings, we proposed a novel method which identify loanwords in monolingual corpora. More specifically, we first use cross-lingual word embeddings as the core feature to generate semantically related candidates based on comparable corpora and a small bilingual lexicon; then, a log-linear model which combines several shallow features such as pronunciation similarity and hybrid language model features to predict the final results. In this paper, we use Uyghur as the receipt language and try to detect loanwords in four donor languages: Arabic, Chinese, Persian and Russian. We conduct two groups of experiments to evaluate the effectiveness of our proposed approach: loanword identification and OOV translation in four language pairs and eight translation directions (Uyghur-Arabic, Arabic-Uyghur, Uyghur-Chinese, Chinese-Uyghur, Uyghur-Persian, Persian-Uyghur, Uyghur-Russian, and Russian-Uyghur). Experimental results on loanword identification show that our method outperforms other baseline models significantly. Neural machine translation models integrating results of loanword identification experiments achieve the best results on OOV translation(with 0.5-0.9 BLEU improvements)

pdf bib

A Neural Network Based Model for Loanword Identification in Uyghur
Chenggang Mi | Yating Yang | Lei Wang | Xi Zhou | Tonghai Jiang
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs

To alleviate data sparsity in spoken Uyghur machine translation, we proposed a log-linear based morphological segmentation approach. Instead of learning model only from monolingual annotated corpus, this approach optimizes Uyghur segmentation for spoken translation based on both bilingual and monolingual corpus. Our approach relies on several features such as traditional conditional random field (CRF) feature, bilingual word alignment feature and monolingual suffixword co-occurrence feature. Experimental results shown that our proposed segmentation model for Uyghur spoken translation achieved 1.6 BLEU score improvements compared with the state-of-the-art baseline.

2016

pdf bib abs

A Bilingual Discourse Corpus and Its Applications
Yang Liu | Jiajun Zhang | Chengqing Zong | Yating Yang | Xi Zhou
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Existing discourse research only focuses on the monolingual languages and the inconsistency between languages limits the power of the discourse theory in multilingual applications such as machine translation. To address this issue, we design and build a bilingual discource corpus in which we are currently defining and annotating the bilingual elementary discourse units (BEDUs). The BEDUs are then organized into hierarchical structures. Using this discourse style, we have annotated nearly 20K LDC sentences. Finally, we design a bilingual discourse based method for machine translation evaluation and show the effectiveness of our bilingual discourse annotations.

pdf bib

Yating Yang

2025

2022

2021

2020

2018

2017

2016

2013

Co-authors

Venues