An Investigation of LLMs’ Inefficacy in Understanding Converse Relations

Large Language Models (LLMs) have achieved remarkable success in many formal language oriented tasks, such as structural data-to-text and semantic parsing. However current benchmarks mostly follow the data distribution of the pre-training data of LLMs. Therefore, a natural question rises that do LLMs really understand the structured semantics of formal languages. In this paper, we investigate this problem on a special case, converse binary relation. We introduce a new benchmark ConvRe focusing on converse relations, which contains 17 relations and 1240 triples extracted from popular knowledge graph completion datasets. Our ConvRe features two tasks, Re2Text and Text2Re, which are formulated as multi-choice question answering to evaluate LLMs’ ability to determine the matching between relations and associated text. For the evaluation protocol, apart from different prompting methods, we further introduce variants to the test text and few-shot example text. We conduct experiments on three popular LLM families and have observed various scaling trends. The re-sults suggest that LLMs often resort to shortcut learning and still face challenges on our proposed benchmark.


Introduction
Large Language Models (LLMs) have demonstrated impressive empirical results on various NLP tasks (Bubeck et al., 2023;OpenAI, 2023;Anthropic, 2023), including formal language-oriented tasks such as structural data-to-text (Xiang et al., 2022) and semantic parsing (Chen et al., 2021;Li et al., 2023a), which require sophisticated comprehension and production of structured language content.Despite these promising advances, a critical concern remains largely unexplored: do these LLMs genuinely understand the nuanced semantics ♣ Equal contribution.
of formal languages, or are they merely exploiting statistical patterns inherent in their pre-training data?If such shortcuts exist, it implies that LLMs may struggle to generalize to novel and unique formal language definitions, potentially hindering the robustness and scalability of practical applications.
In this work, we delve into this question by focusing on a specific aspect of formal language understanding: the comprehension of converse relations.As shown in Figure 1, the converse relation redefines the semantic relation between entities while keeping the surface form of the triple unchanged.For instance, the triple (x, has part, y) should be interpreted as "x has a part called y" in the normal relation (Codd, 1983), while "y has a part called x" in converse form.Notably, LLMs are largely unfamiliar with converse relations, as the data they learn in pre-training mostly comprises normal relation.It's imperative for LLMs to accurately understand and utilize these converse relations, i.e., truly following instructions rather than recalling memorized patterns (shortcuts) about normal relation, as it significantly impacts the semantic coherence of their output.
To systematically evaluate the competence of LLMs in recognizing and processing converse relations, we introduce a novel benchmark, ConvRe.This benchmark draws upon 17 diverse relations and 1240 triples derived from prominent knowledge graph completion datasets.ConvRe introduces two primary tasks, Re2Text and Text2Re, formatted as multiple-choice question answering tests.These tasks challenge LLMs to correctly match relations (Re) with their corresponding natural language text (Text).
During empirical evaluation, we add various prompting methods and introduce variants to the text.More specifically, we manually craft examples of different types in the few-shot prompting, creating a more challenging testbed for these models.Our findings, based on thorough experiments  x has a part called y. x possesses a specific component named y.
(x, has part, y) Converse R ⊤ s ⊤ s ⊤′ y has a part called x.
y contains x.
Table 1: The definition of normal and converse relation.Examples are provided below the notations.A triple can be defined to represent the normal relation R or the converse relation R ⊤ .Each relation is associated with a pairing natural language text, which can further be paraphased.
using three popular LLM families, reveal interesting scaling trends and suggest that performance on understanding formal languages might be inflated by shortcut learning.This exploration contributes to the growing body of literature that seeks to assess the true capabilities of LLMs, and the extent to which they genuinely comprehend the semantics of formal languages.

ConvRe Benchmark
In this section, we will introduce the motivation, task formulation and design choice of our ConvRe benchmark as well as the details surrounding data collection.

Motivation
The recent surge in the performance of LLMs in understanding formal language, including tasks such as semantic parsing or data2text, can potentially be misleading.Traditional evaluation benchmarks used in such tasks often reflect statistical patterns similar to those found in the pre-training data of LLMs.We posit that this could lead LLMs to take a shortcut as described in Geirhos et al. (2020), thereby inflating the understanding of formal language semantics.Instead of comprehensively grasping the semantics, the LLMs might simply be learning the statistical tendencies present in their training data.To this end, we propose a new benchmark that uses normal and converse relations to examine the true semantic comprehension capabilities of LLMs.

Normal and Converse Relation
Normal Relation Formally, a binary relation R over sets X and Y is a set of ordered pairs (x, y) consisting of elements x ∈ X and y ∈ Y (Codd, 1983).Usually, a normal relation R is represented as R = {(x, R, y) =⇒ xRy}, where R is the specific relation phrase.Normal relations usually appear in the knowledge graph, along with a pair of subject x and object y.This triple can be mapped to a semantically equivalent natural language text s.Examples can be found in Table 1.
Converse R y is a hypernym of x. y is more general than x.scenarios.This issue is particularly significant in language processing tasks, where a language model may show an ability to reason that is learned from the training data, but its performance can plummet drastically-sometimes to levels equivalent to random guessing-when superficial correlations are removed from the dataset (Niven and Kao, 2019).
To assess how extensively current LLMs leverage shortcut learning for the evaluation tasks we have designed, we introduce variants to the text in both our tasks.In the Re2Text task, we paraphrase one answer candidate, while in the Text2Re task, we paraphrase the question.Specifically, we modify the key predicate and restructure the sentence (as illustrated in figure 3).We note that the subtle variations on the test text bring different effects LLMs tend to exploit textual similarity shortcuts for prediction, which can mislead the model's performance as it bypasses genuine comprehension.In the regular scenario (top), two shortcuts lead the model towards divergent answers, where the incorrect answer (A) will not be overly preferred.In the hard scenario (bottom), the text for the correct response (B) is modified, transforming two shortcuts into a single one.This solitary shortcut is more likely to misdirect the model towards the incorrect answer (A), highlighting the pitfalls of shortcuts learning.
Converse Relation In addition to the normal relation, we also introduce a converse relation that utilizes the same triple format (x, R, y) to denote the converse mapping R ⊤ .It defines a new form by swapping the pairing order, which can be expressed as R ⊤ = {(x, R, y) =⇒ yRx}.Accordingly, in the converse mapping, the triple (x, R, y) corresponds to the converse natural language text s ⊤ .Examples are provided in Table 1 for further clarity.
It's worth noting that both the normal and converse relation definitions used in our evaluation have a localized scope to minimize ambiguity.This process helps us ascertain whether LLMs can understand the semantics of the custom relation definition rather than resorting to shortcut learning.

Task Formulation
We designed two tasks to assess LLM's understanding of normal and converse relations.Both tasks focus on semantic equivalence translation between relations (Re) and natural language text (Text).

Re2Text
In this task, given the specification of a normal/converse relation and its associated natural language text along with a query triple, the model is asked to determine the natural language text that best aligns semantically with the query triple.

Text2Re
The second task can be considered as the reverse of Re2Text.Given an instruction-formatted similarly to Re2Text-and a query sentence, the model is required to identify the query triple that best matches the query sentence.
Following McKenzie et al. (2022), both tasks are formulated as multi-choice question-answering tasks, providing a concrete method for evaluation.2020), highlighted a phenomenon in deep learning known as shortcut learning.These are decision rules that achieve high performance on standard benchmarks but fail to generalize under more challenging testing conditions such as real-world scenarios.This issue is particularly significant in language processing tasks, where a language model may show an ability to reason that is learned from the training data, but its performance can drop drastically-sometimes to levels equivalent to random guessing-when superficial correlations are removed from the dataset (Niven and Kao, 2019).

Data Collection
To make our tasks more comprehensive, and thus test the LLMs' ability to reason in more complex ways, plausible relations must satisfy two requirements: • The relation is asymmetric, implying that R 6 = R > .An example of such a relation is parent of.Here, the order of the involved entities significantly changes the meaning, as the parent-child relationship is not reciprocal.
Conversely, if the relation is symmetric, such as neighboring country, it would be meaningless to determine whether a given entity should be a head or a tail, as the both are semantically equivalent.
• The involved subject and object are interchangeable.That is, the relation R and its converse counterpart R > should be semantically plausible, though not equivalent.An example of a relation we would avoid under this criterion is native language, which associates a person with a language.A language cannot logically be the subject of native language, thereby disqualifying this relation.

Relations of this sort could allow LLMs to
rely on shortcut learning to solve tasks.For instance, in the case of native language, the entity's type inadvertently gives away the answer so that the LLMs may exploit this leaked  To assess how extensively current LLMs leverage shortcut learning for the evaluation tasks we have designed, we introduce variants to the text in both our tasks.Concretely, we alter the natural language text on both test side and few-shot example side to get the paraphrased variants.
Test Variants In the Re2Text task, we paraphrase one answer candidate, while in the Text2Re task, we paraphrase the question.Specifically, we modify the key predicate and restructure the sentence.We note that the subtle variations on the test text could bring different effects to the two tasks, which will be evidenced by the empirical results in our experiments (see Section 4.2).Examples on the test variants as well as intuitive explanations on their effects on two tasks are provided in Figure 2 and 3. Detailed zero-shot prompting methods can be found in Table 2. * Few-shot Example Variants Beside the variants on the test text, we further introduce variants to the text within the examples used for few-shot prompting.Since we have identified the most challenging variant settings within the zero-shot tasks, we will employ the same configurations for the test text in the few-shot context, denoting these as hard tests.
* Relation settings and hint will be thoroughly discussed in Section 3.2.
Accordingly, we integrate text variants within the examples for the few-shot prompting.A comprehensive list of few-shot prompts utilized in our benchmark can be found in Table 3, and the specific arrangements of text variants are illustrated in Table 4. Notably, if the hard test setting aligns with the unaltered test text (for the Text2Re task), then the unaltered examples are labeled as hard, while the altered examples are labeled as regular.This setup shares the similar spirit as the complexity based prompting (Fu et al., 2022), where hard examples serve to refine problem understanding and mitigate model bias.

Data Collection
To make our tasks more comprehensive, and thus test the LLMs' ability to reason in more complex ways, plausible relations must satisfy two requirements: • The relation is asymmetric, implying that R ̸ = R ⊤ .An example of such a relation is parent of.Here, the order of the involved entities significantly changes the meaning, as the parent-child relationship is not mutual.Conversely, if the relation is symmetric, such as neighboring country, it would be meaningless to determine whether a given entity Table 2: Zero-shot prompts.* : each prompt method has been associated with a unique ID that will be referred to in the experimental results.should be a head or a tail, as the both are semantically equivalent.
• The involved subject and object are interchangeable.That is, the relation R and its converse counterpart R ⊤ should be semantically plausible, though not equivalent.An example of a relation we would avoid under this criterion is native language, which associates a person with a language.A language cannot logically be the subject of native language, thereby disqualifying this relation.
Relations of this sort could allow LLMs to rely on shortcut learning to solve tasks.For instance, in the case of native language, the entity's type inadvertently reveals the answer so that the LLMs may exploit this leaked information.
3 Experiment Setup

Model and Metric
We evaluated three LLM families on our ConvRe benchmark: OpenAI GPT-3 (Brown et al., 2020), Anthropic Claude (Anthropic, 2023), and Google FLAN-T5 (Chung et al., 2022) (model details in Appendix B).Since we do not have enough credits for the OpenAI APIs, we evaluate OpenAI GPT-4 on a subset of our benchmark for few-shot experiments.† We use the classification accuracy as our † The subset is constructed by randomly sampling 20 triples for each relation from the full set.In the case where the number of triples for a particular relation is less than 20, we include all of them.Ultimately, the subset comprises a total of 328 triples.We run GPT-4 on both full set and subset in zero-shot settings.Results show that the subset can reflect the model's performance.Details can be found in Appendix C.

Prompt
Read the instruction and then answer the question using A or B. Note that in this task, if the relation is defined in a converse manner, unlike the conventional definition, you should carefully choose the answer.
Instruction: (x, has part, y) indicates that x has a part called y. Question: (?, has part, solingen) A: Find an entity that solingen contains.B: Find an entity that has a part called solingen.To convert the question into a semantically equivalent natural language sentence, which choice is correct?Look out for the ORDER of the entities in the instruction!Answer: Expected Answer: B main metric for both Re2Text and Text2Re tasks.

Prompting Methods
As depicted in Zhang et al. ( 2023), different prompting methods can have a considerable impact on the scaling trends of language models.To account for this in our study, we utilize diverse prompting methods.Generally, we have zero-shot and few-shot prompting, each tailored with specific design elements.Detailed illustrations are provided in Table 2 3 4.While we previously discussed these from a motivation point of view, this subsection offers a closer look at the implementation specifics.
Zero-shot We assess both normal and converse relations mainly on the zero-shot setting, where each setting is coupled with regular and altered test text (refer to the text variations in Section 2.4).For the converse relation evaluation, we additionally equip the prompt with hint (Kojima et al., 2022).An illustration of the hint used in our experiment is shown in Figure 4.
Few-shot In this setting, we only apply the hard settings, as documented in Table 3.The corresponding zero-shot tests (ID 3# for Text2Re and ID 4# for Re2Text, detailed in Table 2) are employed as baselines.The arrangements for the example variants are thoroughly detailed in Table 4. Within each group, we have three distinct sub-settings: 3-shot, 3-shot with hint & chain-of-thought (CoT, Wei et al. 2022b), and 6-shot.

Results
In this section, we demonstrate the results of different LLM model families on ConvRe benchmark and provide an in-depth analysis.More results on chat models can be found in Appendix D.

Converse Relation
Our first experiment, conducted in the zero-shot setting, involves both normal and converse relations across all model families.As shown in Figure 5, the performance on converse relations, within the scope of unaltered test text, is consistently inferior to that on normal relations across all tested models and tasks.More specifically, we note a roughly positive scaling trend for normal relations and an inverse scaling trend for converse relations, despite some outliers.The state-of-the-art LLM, GPT-4, underperforms compared to smaller models, with its performance falling significantly below randomguess levels.We conjecture that larger models have stronger priors, causing them to rely more heavily on memorized patterns from training data, which can conflict with the given task.

Text Variants
As introduced in Section 2.4, we are curious about LLMs' behaviours against text variants on the test and the few-shot examples.
Our initial focus is the zero-shot setting (Figure 5).For normal relations, test variants cause a noticeable performance drop.It means that if a given answer candidate fits the superficial pattern stated in the instruction, models are more likely to select it although it could be incorrect.This suggests that LLMs tend to take shortcut learning even within conventional problem settings.For converse relations, variants on the test text harm the performance on Re2Text while enhance it on Text2Re.These findings lend strong support to our previous hypothesis presented in Section 2.4.
In the few-shot setting, the zero-shot baselines for both tasks are set to be hard (see    Figure 6: Few-shot results on ConvRe.Each experimental setting has been indexed with a unique ID that can be referred to in Table 3. Sub-figures in the same row share the same figure legend, so we only display it once in the leftmost sub-figure to save space.Detailed settings on the text variants can be found in Table 4.For GPT-4, we only test it on a subset of our benchmark.Due to Flan-T5's weak ability to follow CoT instructions, we do not report the results of Flan-T5 with hint and CoT prompting.across different models on the two tasks.This can be attributed to the fact that hard examples align more consistently with the hard tests and effectively help models in avoiding bias and shortcut learning.

Shot Number
Examples, particularly an increased number of examples, are expected to outperform zero-shot prompting.However, we do not consistently observe improvements across different models and tasks.Notably, GPT models demonstrate the most consistent improvements, indicating superior incontext learning abilities among these models.Interestingly, when using few-shot examples, the models mostly exhibit inverse scaling or inverted Ushaped scaling, which suggests that our benchmark presents a challenge for the current LLMs.

Hint and CoT
The zero-shot experiments in Figure 5 indicate that the use of hints in prompts typically yields improvements for GPT and Flan-T5 models.However, claude-1 stands out as an exception, appearing to be negatively affected by the hint.
In the few-shot experiments, employing hints and the Chain-of-Thought (CoT) approach substantially boosts performance, particularly for larger models.GPT models exhibit positive scaling and U-shaped scaling on the Re2Text task.However, for the Text2Re task, we still observe inverted Ushaped scaling for GPT models and inverse scaling for Claude models.This indicates that LLMs still struggle on this task even with strong prompting methods.We also find that Flan-T5 cannot properly follow CoT instructions, so we do not report the results of Flan-T5 with hint and CoT prompting.

Related Work
Studies on LLMs have shown positive scaling trends, whereby larger models generally perform better on downstream tasks (Brown et al., 2020;Rae et al., 2021;Chowdhery et al., 2022;Srivastava et al., 2022;Liang et al., 2022).However, researchers showed that model performance scaling can deviate from naive expectations.Srivastava et al. (2022) showed slower and less smooth trends, and that social biases sometimes scale inversely with model size, a finding that is echoed in Parrish et al. (2022).TruthfulQA (Lin et al., 2022) demonstrated that while larger models can provide more informative answers, they tend to be less truthful.McKenzie et al. (2022) introduced the inverse scaling challenge and collected tasks that are highly atypical but still easily understandable by a human.Wei et al. (2022a) uncovered the U-shaped scaling trend by expanding the model scope for evaluation.Zhang et al. (2023) proposed NeQA and showed that this task exhibit inverse, U-shaped, or positive scaling with different prompt methods or model families.Miceli-Barone et al. (2023) showed that LLMs fail to correctly generate Python code when default identifiers are swapped.
Recent research has highlighted the issue of inflated performance in LLMs.Geirhos et al. (2020) coined the term shortcut learning, revealing models' reliance on superficial cues.Tu et al. (2020) studied the model's robustness to spurious correlations, which refers to the prediction rules that work for the majority examples but do not hold in general.Li et al. (2023b) found that LLMs tend to rely on shallow matching rather than understanding mathematical concepts.Bender et al. (2021) highlighted the importance of understanding the mechanism by which LLMs achieved state-of-the-art performance.Perez et al. (2021) showed that LLMs' few-shot ability is often overestimated due to the use of large held-out sets.Ji et al. (2023) surveyed the hallucination problem in language generation, highlighting the issue of factually incorrect output.Liu et al. (2023) identified attention glitches in Transformers, indicating a failure in capturing robust reasoning.

Concolusion
In this paper, we present an investigation into LLMs' understanding of structured semantics, specifically focusing on converse binary relations.By introducing a novel benchmark, ConvRe, we offer a systematic and comprehensive evaluation suite to observe the performance of LLMs across diverse settings and prompting methods.We have carried out a detailed experimental study and observed various scaling trends that shed light on the capabilities and limitations of LLMs.Our findings suggest that LLMs often resort to shortcut learning and still face considerable challenges on our proposed benchmark, even when strong prompting techniques are employed.Our work underscores the importance of developing evaluation methodologies to improve the understanding of LLMs and their performance across various tasks.

Limitations
This paper proposes a new benchmark ConvRe to evaluate the competence of LLMs in recognizing and processing converse relations.Due to the limitation of the budget, we have evaluated three representative LLMs families on our benchmark.We note that the LLM APIs may change over time.Although we have set the sampling temperature to 0, we cannot fully guarantee the reproducibility of our results.Another potential limitation is the prompting methods used in this work.To automatically evaluate the model's performance, we have followed the previous studies and formatted the tasks as multiple-choice question answering tests.This setting may affect the performance of smaller models.

Ethics Statement
Our work proposes a new benchmark to help reveal the real capability of LLMs in formal language oriented tasks.The triples in our benchmark are all extracted from publicly available and widely used knowledge graph dataset.We show that LLMs have taken shortcut learning in these tasks and their performance could be inflated.These findings may help users have a better understanding of LLMs and avoid the potential risks.

B Model Family Details
B.1 OpenAI GPT The models we use in our experiments are mainly GPT-3 models (text-ada-001, text-babbage-001 and text-curie-001), GPT-3.5 models (text-davinci-003 and gpt-3.5-turbo)and GPT4.GPT-3 models can understand and generate natural language.These models were superceded by the more powerful GPT-3.5 generation models.Among the GPT-3.5 models, gpt-3.5-turbohas been optimized for chat but also works well for traditional completion tasks.GPT-4 is a large multimodal model that can solve difficult problems with greater accuracy than any of the models in OpenAI GPT family.

B.2 Anthropic Claude
Claude is capable of a wide variety of conversational and text processing tasks, it can help with use cases including summarization, search, creative and collaborative writing.Claude comes with two different sizes: claude-1 and claude-instant-1.claude-1 is the largest model in Claude family and ideal for a wide range of complex tasks.claudeinstant-1 is a smaller model with far lower latency.Both of the models are provided with many different sub-versions.Among them, claude-1.3and claude-instant-1.1 are used for our experiments.

B.3 Google Flan-T5
Flan-T5 is an enhanced version of T5 that has been finetuned in a mixture of tasks.Unlike the OpenAI GPT model, Flan-T5 is an encoder-decoder model.There are five models with different sizes in Flan-T5 family: Flan-T5-Small, Flan-T5-Base, Flan-T5-Large, Flan-T5-XL and Flan-T5-XXL.All the five models are used in the experiments.

C Subset Results
To verify that the constructed subset can unbiasedly reflect the performance of GPT-4 model, we compare the performance of GPT-4 model on both benchmark dataset and subset.The results is shown in Table 6.The performance of the GPT-4 model shows minimal difference between the complete set and the subset, confirming the validity of the subset.

D Chat Model Performance
As chat models usually have a better ability to follow instructions, they may demonstrate a different scaling trend on our benchmark.Therefore, we independently evaluate and compare the two chat model families (i.e.OpenAI GPT and Anthropic Claude) on our benchmark.As GPT-4 is also optimized for chat, we include it for analysis as well.
The performance of the two families are shown in Figure 7.
In Re2Text task, it can be observed that few-shot with chain-of-thought can significantly improve the performance of GPT models.The accuracy of GPT4 demonstrates a remarkable improvement, soaring from below 0.2 in the zero-shot setting to surpassing 0.9 in the few-shot+hint-CoT setting.Chain-of-Thought is also helpful in improving the performance of Claude-1.
In Text2Re task, GPT models exhibit a distinct and consistent inverse scaling trend in both zero-  shot and few-shot settings when the relation is conversed.However, the scaling trend of Claude models is more intricate.Specifically, in zero-shot settings, Claude models demonstrate a positive scaling trend in the majority of settings.In few-shot settings, on the contrary, an inverse scaling trend is exhibited by Claude models.

E Model Behaviors
This section introduces the behaviors of different models that we observe during experiments.Under zero-shot settings, Claude and Flan-T5 can generate answers in the expected behavior.However, text-ada-001 and text-babbage-001 fails in most cases, they tend to repeat our question or instruction.In our experiments, if these two models don't give a clear answer, we will treat the choices with higher log probability in the first token as its answer.
In few-shot settings, nearly all models except Flan-T5 conform to the expected answer format.The generated thoughts of Flan-T5 are usually shorter than the examples, and the format of its answers seldom aligns with the expected format.

Figure 1 :
Figure 1: Illustration of converse relation comprehension by LLMs.This diagram highlights the unique challenges converse relations present for LLMs, potentially leading to diverse scaling trends.

Figure 2 :
Figure 2: Examples of Re2Text and Text2Re tasks on converse relation.We additionally paraphrase the natural language representations (answer candidates for Re2Text, question for Text2Re) to make them differ from the sentences in the Instruction.
shortcut learning.These are de-153 cision rules that achieve high performance on standard benchmarks but fail to generalize under more challenging testing conditions such as real-world

Figure 2 :
Figure 2: The Re2Text task converts relation into semantically equivalent natural language text.Given that LLMs mostly encounter normal relations during pre-training, deciphering converse relations poses a significant challenge.LLMs tend to exploit textual similarity shortcuts for prediction, which can mislead the model's performance as it bypasses genuine comprehension.In the regular scenario (top), two shortcuts lead the model towards divergent answers, where the incorrect answer (A) will not be overly preferred.In the hard scenario (bottom), the text for the correct response (B) is modified, transforming two shortcuts into a single one.This solitary shortcut is more likely to misdirect the model towards the incorrect answer (A), highlighting the pitfalls of shortcuts learning.
sword, has part, hilt) Text: sword has a part called hilt LLM Pre-Training Corpus

Figure 3 :
Figure3: The Text2Re task converts natural language text into semantically equivalent relation triple.As with the Re2Text task, this process can be misled by shortcut learning.In the regular scenario (top), an altered question is used, resulting in a single shortcut that leads the model towards the incorrect answer (A).In the hard scenario (bottom), the combination of natural language text and the relation definition creates two shortcuts, both leading to the incorrect answer (A), thus increasing the likelihood of the model's misprediction.

Figure 4 :
Figure 4: An illustration of zero-shot prompting with hint.Red color font indicates the hint.

Figure 5 :
Figure5: Zero-shot results on ConvRe.Each experimental setting has been indexed with a unique ID that can be referred to in Table2.Sub-figures in the same row share the same figure legend, so we only display it once in the leftmost sub-figure to save space.

Figure 8 to
Figure 8 to Figure 19 demonstrate the 12 kinds of prompts used in Re2Text tasks.

Table 1 :
The definition of normal and converse relation.Examples are provided below the notations.A triple can be defined to represent the normal relation R or the converse relation R > .Each relation is associated a pairing natural language representation, which can further be paraphased.

Table 3 :
Few-shot prompts.♠: the hard test setting is always employed (see Table2). ♣ : example are provided in two options, regular and hard.

Table 4 :
Text variants on test and example sides for few-shot prompting.

Table 3 4
).Generally, hard examples outperform standard examples (hard-hard vs. standard hard) on average Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang.2018.One-shot relational learning for knowledge graphs.In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1980-1990.To meet the second condition for relations in Sec 2.5, we merge the relation mother of person from NELL-ONE dataset with the relation father from Wikidata5M to create a new relation called parent of.In this way, there are 17 relations in total, and the detailed number of triples for each relation is shown in Table 5.The source knowledge graphs these relations come from cover a wide range of domains, such as socio-political and commonsense, which can ensure the diverseity of our dataset.

Table 6 :
The comparison results of GPT-4 model between complete set and subset under zero shot settings.

Table 5 :
The details of the relations in our ConvRe benchmark