The era of Large Language Models (LLMs) raises new demands for automatic evaluation metrics, which should be adaptable to various application scenarios while maintaining low cost and effectiveness. Traditional metrics for automatic text evaluation are often tailored to specific scenarios, while LLM-based evaluation metrics are costly, requiring fine-tuning or rely heavily on the generation capabilities of LLMs. Besides, previous LLM-based metrics ignore the fact that, within the space of LLM representations, there exist direction vectors that indicate the estimation of text quality. To this end, we introduce RepEval, a metric that leverages the projection of LLM representations for evaluation. Through simple prompt modifications, RepEval can easily transition to various tasks, requiring only minimal sample pairs for direction vector construction. Results on fourteen datasets across two evaluation tasks demonstrate the high effectiveness of our method, which exhibits a higher correlation with human judgments than previous methods, even in complex evaluation scenarios involving pair-wise selection under nuanced aspects. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
The majority of automatic metrics for evaluating NLG systems are reference-based. However, the challenge of collecting human annotation results in a lack of reliable references in numerous application scenarios. Despite recent advancements in reference-free metrics, it has not been well understood when and where they can be used as an alternative to reference-based metrics. In this study, by employing diverse analytical approaches, we comprehensively assess the performance of both metrics across a wide range of NLG tasks, encompassing eight datasets and eight evaluation models. Based on solid experiments, the results show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality. However, their effectiveness varies across tasks and is influenced by the quality of candidate texts. Therefore, it’s important to assess the performance of reference-free metrics before applying them to a new task, especially when inputs are in uncommon form or when the answer space is highly variable. Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
Graph-to-text (G2T) generation and text-to-graph (T2G) triple extraction are two essential tasks for knowledge graphs. Existing unsupervised approaches become suitable candidates for jointly learning the two tasks due to their avoidance of using graph-text parallel data. However, they adopt multiple complex modules and still require entity information or relation type for training. To this end, we propose INFINITY, a simple yet effective unsupervised method with a unified pretrained language model that does not introduce external annotation tools or additional parallel information. It achieves fully unsupervised graph-text mutual conversion for the first time. Specifically, INFINITY treats both G2T and T2G as a bidirectional sequence generation task by fine-tuning only one pretrained seq2seq model. A novel back-translation-based framework is then designed to generate synthetic parallel data automatically. Besides, we investigate the impact of graph linearization and introduce the structure-aware fine-tuning strategy to alleviate possible performance deterioration via retaining structural information in graph sequences. As a fully unsupervised framework, INFINITY is empirically verified to outperform state-of-the-art baselines for G2T and T2G tasks. Additionally, we also devise a new training setting called cross learning for low-resource unsupervised information extraction.
Researchers usually come up with new ideas only after thoroughly comprehending vast quantities of literature. The difficulty of this procedure is exacerbated by the fact that the number of academic publications is growing exponentially. In this study, we devise a framework based on concept co-occurrence for academic idea inspiration, which has been integrated into a research assistant system. From our perspective, the emergence of a new idea can be regarded as the fusion of two concepts that co-occur in an academic paper. We construct evolving concept graphs according to the co-occurrence relationship of concepts from 20 disciplines or topics. Then we design a temporal link prediction method based on masked language model to explore potential connections between different concepts. To verbalize the newly discovered connections, we also utilize the pretrained language model to generate a description of an idea based on a new data structure called co-occurrence citation quintuple. We evaluate our proposed system using both automatic metrics and human assessment. The results demonstrate that our system has broad prospects and can assist researchers in expediting the process of discovering new ideas.
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields. However, LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations in many real-world applications. Existing works for detecting hallucinations in LLMs either rely on external knowledge for reference retrieval or require sampling multiple responses from the LLM for consistency verification, making these methods costly and inefficient. In this paper, we propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs. Our approach imitates human focus in factuality checking from three aspects: 1) focus on the most informative and important keywords in the given text; 2) focus on the unreliable tokens in historical context which may lead to a cascade of hallucinations; and 3) focus on the token properties such as token type and token frequency. Experimental results on relevant datasets demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance across all the evaluation metrics and eliminates the need for additional information.
Joint relational triple extraction from unstructured text is an important task in information extraction. However, most existing works either ignore the semantic information of relations or predict subjects and objects sequentially. To address the issues, we introduce a new blank filling paradigm for the task, and propose a relation-first blank filling network (RFBFN). Specifically, we first detect potential relations maintained in the text to aid the following entity pair extraction. Then, we transform relations into relation templates with blanks which contain the fine-grained semantic representation of the relations. Finally, corresponding subjects and objects are extracted simultaneously by filling the blanks. We evaluate the proposed model on public benchmark datasets. Experimental results show our model outperforms current state-of-the-art methods. The source code of our work is available at: https://github.com/lizhe2016/RFBFN.
Relational structures such as schema linking and schema encoding have been validated as a key component to qualitatively translating natural language into SQL queries. However, introducing these structural relations comes with prices: they often result in a specialized model structure, which largely prohibits using large pretrained models in text-to-SQL. To address this problem, we propose RASAT: a Transformer seq2seq architecture augmented with relation-aware self-attention that could leverage a variety of relational structures while inheriting the pretrained parameters from the T5 model effectively. Our model can incorporate almost all types of existing relations in the literature, and in addition, we propose introducing co-reference relations for the multi-turn scenario. Experimental results on three widely used text-to-SQL datasets, covering both single-turn and multi-turn scenarios, have shown that RASAT could achieve competitive results in all three benchmarks, achieving state-of-the-art execution accuracy (75.5% EX on Spider, 52.6% IEX on SParC, and 37.4% IEX on CoSQL).