Peng Wang

Southeast University Nanjing

Other people with similar names: Peng Wang (May refer to several people), Peng Wang (Macau University, Central South University), Peng Wang (Zhejiang University), Peng Wang (Chinese Academy of Sciences), Peng Wang (Fudan University), Peng Wang (University of Virginia)


2025

pdf bib
Acquisition and Application of Novel Knowledge in Large Language Models
Ziyu Shang | Jianghan Liu | Zhizhao Luo | Peng Wang | Wenjun Ke | Jiajun Liu | Zijie Xu | Guozheng Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advancements in large language models (LLMs) have demonstrated their impressive generative capabilities, primarily due to their extensive parameterization, which enables them to encode vast knowledge. However, effectively integrating new knowledge into LLMs remains a major challenge. Current research typically first constructs novel knowledge datasets and then injects this knowledge into LLMs through various techniques. However, existing methods for constructing new datasets either rely on timestamps, which lack rigor, or use simple templates for synthesis, which are simplistic and do not accurately reflect the real world. To address this issue, we propose a novel knowledge dataset construction approach that simulates biological evolution using knowledge graphs to generate synthetic entities with diverse attributes, resulting in a dataset, NovelHuman. Systematic analysis on NovelHuman reveals that the intra-sentence position of knowledge significantly affects the acquisition of knowledge. Therefore, we introduce an intra-sentence permutation to enhance knowledge acquisition. Furthermore, given that potential conflicts exist between autoregressive (AR) training objectives and permutation-based learning, we propose PermAR, a permutation-based language modeling framework for AR models. PermAR seamlessly integrates with mainstream AR architectures, endowing them with bidirectional knowledge acquisition capabilities. Extensive experiments demonstrate the superiority of PermAR, outperforming knowledge augmentation methods by 3.3%-38%.

pdf bib
LLM-Guided Semantic-Aware Clustering for Topic Modeling
Jianghan Liu | Ziyu Shang | Wenjun Ke | Peng Wang | Zhizhao Luo | Jiajun Liu | Guozheng Li | Yining Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Topic modeling aims to discover the distribution of topics within a corpus. The advanced comprehension and generative capabilities of large language models (LLMs) have introduced new avenues for topic modeling, particularly by prompting LLMs to generate topics and refine them by merging similar ones. However, this approach necessitates that LLMs generate topics with consistent granularity, thus relying on the exceptional instruction-following capabilities of closed-source LLMs (such as GPT-4) or requiring additional training. Moreover, merging based only on topic words and neglecting the fine-grained semantics within documents might fail to fully uncover the underlying topic structure. In this work, we propose a semi-supervised topic modeling method, LiSA, that combines LLMs with clustering to improve topic generation and distribution. Specifically, we begin with prompting LLMs to generate a candidate topic word for each document, thereby constructing a topic-level semantic space. To further utilize the mutual complementarity between them, we first cluster documents and candidate topic words, and then establish a mapping from document to topic in the LLM-guided assignment stage. Subsequently, we introduce a collaborative enhancement strategy to align the two semantic spaces and establish a better topic distribution. Experimental results demonstrate that LiSA outperforms state-of-the-art methods that utilize GPT-4 on topic alignment, and exhibits competitive performance compared to Neural Topic Models on topic quality. The codes are available at https://github.com/ljh986/LiSA.

pdf bib
On the Consistency of Commonsense in Large Language Models
Guozheng Li | Peng Wang | Wenjun Ke | Zijie Xu | Jiajun Liu | Ziyu Shang
Findings of the Association for Computational Linguistics: ACL 2025

Commonsense, humans’ implicit understanding of everyday situations, is crucial for large language models (LLMs). Existing commonsense evaluations for LLMs primarily focus on downstream knowledge tasks, failing to probe whether LLMs truly understand and utilize knowledge or merely memorize it. They also rely heavily on human annotation and lack automated large-scale data generation. To address this, we propose to automatically construct a large benchmark named CoCo (Consistency of Commonsense) comprising 39K samples derived from commonsense knowledge graphs (CSKGs), paired with symbolic questions and ground-truth answers, which systematically assesses LLMs’ knowledge memorization, comprehension, and application and examines the consistency between these tasks. To enhance our evaluation, we also propose novel metrics and prompting strategies. Experimental results on multiple LLMs reveal that CoCo presents significant challenges, and our detailed analysis provides deeper insights into the strengths and limitations of LLMs’ commonsense abilities.

2024

pdf bib
Boosting Textural NER with Synthetic Image and Instructive Alignment
Jiahao Wang | Wenjun Ke | Peng Wang | Hang Zhang | Dong Nie | Jiajun Liu | Guozheng Li | Ziyu Shang
Findings of the Association for Computational Linguistics: ACL 2024

Named entity recognition (NER) is a pivotal task reliant on textual data, often impeding the disambiguation of entities due to the absence of context. To tackle this challenge, conventional methods often incorporate images crawled from the internet as auxiliary information. However, the images often lack sufficient entities or would introduce noise. Even with high-quality images, it is still challenging to efficiently use images as auxiliaries (i.e., fine-grained alignment with texts). We introduce a novel method named InstructNER to address these issues. Leveraging the rich real-world knowledge and image synthesis capabilities of a large pre-trained stable diffusion (SD) model, InstructNER transforms the text-only NER into a multimodal NER (MNER) task. A selection process automatically identifies the best synthetic image by comparing fine-grained similarities with internet-crawled images through a visual bag-of-words strategy. Note, during the image synthesis, a cross-attention matrix between synthetic images and raw text emerges, which inspires a soft attention guidance alignment (AGA) mechanism. AGA optimizes the MNER task and concurrently facilitates instructive alignment in MNER. Empirical experiments on prominent MNER datasets show that our method surpasses all text-only baselines, improving F1-score by 1.4% to 2.3%. Remarkably, even when compared to fully multimodal baselines, our approach maintains competitive. Furthermore, we open-source a comprehensive synthetic image dataset and the code to supplement existing raw dataset. The code and datasets are available in https://github.com/Heyest/InstructNER.

pdf bib
Unlocking Instructive In-Context Learning with Tabular Prompting for Relational Triple Extraction
Guozheng Li | Wenjun Ke | Peng Wang | Zijie Xu | Ke Ji | Jiajun Liu | Ziyu Shang | Qiqing Luo
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The in-context learning (ICL) for relational triple extraction (RTE) has achieved promising performance, but still encounters two key challenges: (1) how to design effective prompts and (2) how to select proper demonstrations. Existing methods, however, fail to address these challenges appropriately. On the one hand, they usually recast RTE task to text-to-text prompting formats, which is unnatural and results in a mismatch between the output format at the pre-training time and the inference time for large language models (LLMs). On the other hand, they only utilize surface natural language features and lack consideration of triple semantics in sample selection. These issues are blocking improved performance in ICL for RTE, thus we aim to tackle prompt designing and sample selection challenges simultaneously. To this end, we devise a tabular prompting for RTE (TableIE) which frames RTE task into a table generation task to incorporate explicit structured information into ICL, facilitating conversion of outputs to RTE structures. Then we propose instructive in-context learning (I2CL) which only selects and annotates a few samples considering internal triple semantics in massive unlabeled samples. Specifically, we first adopt off-the-shelf LLMs to perform schema-agnostic pre-extraction of triples in unlabeled samples using TableIE. Then we propose a novel triple-level similarity metric considering triple semantics between these samples and train a sample retrieval model based on calculated similarities in pre-extracted unlabeled data. We also devise three different sample annotation strategies for various scenarios. Finally, the annotated samples are considered as few-shot demonstrations in ICL for RTE. Experimental results on two RTE benchmarks show that I2CL with TableIE achieves state-of-the-art performance compared to other methods under various few-shot RTE settings.

2023

pdf bib
Revisiting Large Language Models as Zero-shot Relation Extractors
Guozheng Li | Peng Wang | Wenjun Ke
Findings of the Association for Computational Linguistics: EMNLP 2023

Relation extraction (RE) consistently involves a certain degree of labeled or unlabeled data even if under zero-shot setting. Recent studies have shown that large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt, which provides the possibility of extracting relations from text without any data and parameter tuning. This work focuses on the study of exploring LLMs, such as ChatGPT, as zero-shot relation extractors. On the one hand, we analyze the drawbacks of existing RE prompts and attempt to incorporate recent prompt techniques such as chain-of-thought (CoT) to improve zero-shot RE. We propose the summarize-and-ask (SumAsk) prompting, a simple prompt recursively using LLMs to transform RE inputs to the effective question answering (QA) format. On the other hand, we conduct comprehensive experiments on various benchmarks and settings to investigate the capabilities of LLMs on zero-shot RE. Specifically, we have the following findings: (i) SumAsk consistently and significantly improves LLMs performance on different model sizes, benchmarks and settings; (ii) Zero-shot prompting with ChatGPT achieves competitive or superior results compared with zero-shot and fully supervised methods; (iii) LLMs deliver promising performance in extracting overlapping relations; (iv) The performance varies greatly regarding different relations. Different from small language models, LLMs are effective in handling challenge none-of-the-above (NoTA) relation.

2021

pdf bib
Hyperbolic Hierarchy-Aware Knowledge Graph Embedding for Link Prediction
Zhe Pan | Peng Wang
Findings of the Association for Computational Linguistics: EMNLP 2021

Knowledge graph embedding (KGE) using low-dimensional representations to predict missing information is widely applied in knowledge completion. Existing embedding methods are mostly built on Euclidean space, which are difficult to handle hierarchical structures. Hyperbolic embedding methods have shown the promise of high fidelity and concise representation for hierarchical data. However, the logical patterns in knowledge graphs are not considered well in these methods. To address this problem, we propose a novel KGE model with extended Poincaré Ball and polar coordinate system to capture hierarchical structures. We use the tangent space and exponential transformation to initialize and map the corresponding vectors to the Poincaré Ball in hyperbolic space. To solve the boundary conditions, the boundary is stretched and zoomed by expanding the modulus length in the Poincaré Ball. We optimize our model using polar coordinate and changing operators in the extended Poincaré Ball. Experiments achieve new state-of-the-art results on part of link prediction tasks, which demonstrates the effectiveness of our method.

2020

pdf bib
AprilE: Attention with Pseudo Residual Connection for Knowledge Graph Embedding
Yuzhang Liu | Peng Wang | Yingtai Li | Yizhan Shao | Zhongkai Xu
Proceedings of the 28th International Conference on Computational Linguistics

Knowledge graph embedding maps entities and relations into low-dimensional vector space. However, it is still challenging for many existing methods to model diverse relational patterns, especially symmetric and antisymmetric relations. To address this issue, we propose a novel model, AprilE, which employs triple-level self-attention and pseudo residual connection to model relational patterns. The triple-level self-attention treats head entity, relation, and tail entity as a sequence and captures the dependency within a triple. At the same time the pseudo residual connection retains primitive semantic features. Furthermore, to deal with symmetric and antisymmetric relations, two schemas of score function are designed via a position-adaptive mechanism. Experimental results on public datasets demonstrate that our model can produce expressive knowledge embedding and significantly outperforms most of the state-of-the-art works.