Zero-shot Triplet Extraction by Template Infilling

The task of triplet extraction aims to extract pairs of entities and their corresponding relations from unstructured text. Most existing methods train an extraction model on training data involving specific target relations, and are incapable of extracting new relations that were not observed at training time. Generalizing the model to unseen relations typically requires fine-tuning on synthetic training data which is often noisy and unreliable. We show that by reducing triplet extraction to a template infilling task over a pre-trained language model (LM), we can equip the extraction model with zero-shot learning capabilities and eliminate the need for additional training data. We propose a novel framework, ZETT (ZEro-shot Triplet extraction by Template infilling), that aligns the task objective to the pre-training objective of generative transformers to generalize to unseen relations. Experiments on FewRel and Wiki-ZSL datasets demonstrate that ZETT shows consistent and stable performance, outperforming previous state-of-the-art methods, even when using automatically generated templates. https://github.com/megagonlabs/zett/


Introduction
Extracting pairs of entities and their relations from unstructured text is vital to several applications including knowledge base population, text retrieval and question answering (Lin et al., 2015;Xu et al., 2016). Traditional approaches (Yu and Lam, 2010;Singh et al., 2013) obtain entity pairs and relations step-by-step by considering entity recognition and relation classification as two separate sub-tasks. However, such multi-step approaches suffer from cascading errors and ignore interdependence between the tasks. Recent studies aim at obtaining entities and relations together in a single step. Given a set of pre-defined relation types and an input sen- * The work was done when Bosung Kim was a research intern at Megagon Labs. tence, they extract triplets of the form (head, relation, tail). We refer to this task as triplet extraction (illustrated in Fig. 1).
If the relation types are pre-defined, an extraction model can be trained on large-scale labeled data acquired via distant supervision or crowdsourcing (Sorokin and Gurevych, 2017;Han et al., 2018). However, such methods are hard to adopt in realworld scenarios where ground-truth entities and relation types cannot be specified in advance. To overcome these limitations, there is an increasing interest to generalize models to extract entities and relations that are not observed during training, aka zero-shot setting.
Automatically generating training data for unseen relations is a widely used approach to render zero-shot capabilities to an extraction model. Distant supervision (Mintz et al., 2009;Zeng et al., 2015;Ji et al., 2017) and data augmentation (Chia et al., 2022) can provide automatically labeled training data but suffer from quality and consistency of the synthetic data. They also require additional finetuning the model on the synthetic data which can be computationally intensive. Yet another set of approaches (Zhong et al., 2021;Sanh et al., 2022) aim to extract unseen relations by learning crosstask knowledge from using a collection of similar datasets and tasks. However, the performance of these methods depends on the similarity between seen and unseen relations.
Recent progress in large language models (LMs) such as GPT-3 has shown that they are capable of zero-shot learning if the task objective is aligned with LM training objective (Brown et al., 2020). Standard NLP tasks such as classification (Zhong et al., 2021) and relation extraction (Levy et al., 2017;Obamuyide and Vlachos, 2018) have been successfully reformulated into prompt-based tasks. However, prior work on zero-shot triplet extraction (Chia et al., 2022) still misaligns the target objective with LM training objective. In this work, we formulate triplet extraction as a template infilling task to directly optimize the zero-shot learning objective.
We propose a novel framework, ZETT (ZEroshot Triplet extraction by Template infilling). It is based on an end-to-end generative language model, T5 (Raffel et al., 2019), that is pre-trained to predict masked consecutive spans. Figure 2 shows the overview of ZETT. To align with the pre-training objective with task objective, it relies natural language templates for relations where the head and tail entities have been masked out. It extends the input sentence with the template and fine-tunes T5 to correctly predict the participant entities. At inference, it predicts entity pairs and scores given templates of unseen relations and provides a ranked list of triplets. In this manner, it predicts entities and unseen relations in a single-step without requiring additional fine-tuning.
Relying on templates for triplet extraction has several advantages. A template encodes implicit information such an entity types that can help correctly predict the entities as well as their order. For example, for the relation participant in, the template "<head> is a participant in <tail>" gives implicit type constraints such that the head entity should be human or sports team and the tail entity should be game or contest. The template also provides information to help the model correctly predict the entity order.
Experiments on two datasets shows that ZETT achieves state-of-the-art performance even with a much smaller model than the counterparts. We find that the templates in ZETT can more effectively identify entities than previous approaches. We also show that ZETT can be integrated with automatically generated templates without significant loss in performance.
Our contributions are as follows: 1) We pro-pose a new framework for zero-shot triplet extraction, called ZETT, that uses templates to align finetuning to the pre-training objective. 2) ZETT is more robust in extracting entities as templates give implicit information of entity types. 3) Experiments on two datasets show that ZETT significantly outperforms in the single-triplet settings and also competitive performances in multi-triplet settings. We achieve up to 6.35 points over the previous best performing baseline with a smaller model size.

Related Work
Zero-and few-shot Triplet Extraction Triplet extraction 1 refers to a task of knowledge extractions, which goal is to extract a triplet form of knowledge: (head, relation, tail) from source texts. Compared to relation extraction tasks, triplet extraction is more challenging in that models need to predict entities and the relation at once, but it has been less explored. Recently, Paolini et al. (2021) suggested a supervised method based on the text-totext translation approach. For the zero-shot learning, RelationPrompt (Chia et al., 2022) introduced a sequence-to-sequence model, where the decoder generates a structured template of triplet, such as "Head Entity: <head>, Tail Entity: <tail>, Relation <relation>", given the source context. This approach relies on the synthetic data generation for unseen relations. However, the quality and consistency of synthetic data are not guaranteed, and it has a limitation in that fine-tuning is misaligned to pre-training. Triplet extraction also can be incorporated with relation extraction models. For zero-and few-shot relation extraction,  introduced the concept enhanced relation classification model which leverage additional information of entities to predict unseen relations. Zhong et al. (2021) suggested a meta-tuning approach, where the model is trained on the collection of similar tasks so that the model can learn cross-task knowledge. Wang et al. (2021) suggested DEEPEX, a unified framework for zero-shot relation classification and information extraction based on the text-to-text model.
LM prompt-tuning with templates Our approach shares the idea with LM prompt-tuning in that using natural language templates to reduce the gap between pre-training and downstream tasks.

P LM
Triplet Prediction + + Figure 2: Overview of fine-tuning and inference for triplet extraction in ZETT.
Most approaches adopt reframing the downstream task as a masked language modeling problem. Similar methods have been introduced, such as text classification tasks (Obamuyide and Vlachos, 2018;Hu et al., 2022;Chen et al., 2022), named entity recognition (Cui et al., 2021). They proposed prompt-based fine-tuning which is transformed a classification problem into a masked LM task using task descriptions. PET (Schick and Schütze, 2021a,b) uses natural language patterns to reformulate input examples as colze-style phrases. However, these approaches require to create templates and the performance depends on the templates (Jiang et al., 2020;Hu et al., 2022). The techniques to automatically generate templates have been actively studied (Jiang et al., 2020;Shin et al., 2020). In the experiments, we provide the results of ZETT with automated templates from Jiang et al. (2020).

Task Definition
The goal of triplet extraction is to extract knowledge in the form of triplet: <head (h), relation (r), tail (t)> from a unstructured sentence. The input of the model is a context x, then the model outputs a triplet <h, r, t> or a set of triplets when x contains multiple triplets. In a general triplet extraction setting, h and t are text spans of the input context x, and the set of relations is predefined. For the zero-shot triplet extraction, we follow the zero-shot setting of Chia et al. (2022). We assume that no training examples for unseen relations are present, thus the model should predict new triplet <h, r, t>, where the relation r is not seen during the training. In the following sections, we denote the set of seen relations as R s and unseen relations as R u , where R s and R u are disjoint.

ZETT: Zero-shot Triplet Extraction by Template Infilling
In this section, we introduce a new framework for zero-shot triplet extraction, called ZETT. Our strategy to infer unseen relations is to maximize the benefit of a large PLM by aligning the fine-tuning task to pre-training of PLM. We use the pre-trained T5 model (Raffel et al., 2019) and formulate the triplet extraction as a text span infilling problem using natural language templates. T5 model is pre-trained to predict randomly dropped-out consecutive spans of the sentence. This is useful to retrieve entity spans because the length of entity spans varies, and does not need to predefine the number of mask tokens that should be generated. This also frees us from the burden of using iterative methods (Cui et al., 2021). We apply this to triplet extraction by masking entity tokens in the template, then fine-tune the model to predict the masked entity tokens. Figure 2 shows the overview of fine-tuning and inference for the triplet extraction in ZETT. For the input of the model, first we manually build the templates for all relations as a natural language sentence. Table 1 shows examples of relations and templates. Each template is created based on the description provided from the Wikidata (Vrandečić and Krötzsch, 2014

Inference with relation constraint
In inference, we evaluate the model on the contexts that have unseen relations R u . Let r j u ∈ R u and |R u | = m, then we first build m input sequences in the same way as fine-tuning: concatenating the context x and the template t(r j u ) and replacing placeholders with mask tokens. Then we generate entity tokens y j : "<X> e j 1 <Y> e j 2 <Z>" for all m input sequences and score the outputs with the probability function (1). For the final prediction, we choose the triplet <e j 1 , r j u , e j 2 > 3 with r j u = arg max r j u ∈Ru P (y j |x, t(r j u )). In experiments, we use a beam search to decode multiple output sequences so that the model can predict multiple triplets.
We could compute a score for every relations and then sort them. However, this is inefficient and could be slow especially when the number of relations is very large. Therefore, instead of exhaustive scoring, we exploit relation constraints to filter out relations which is irrelevant to the context. As Chen and Li (2021) has shown that the relation descriptions can be useful to classify the relation of the context in the zero-shot setting, we use the 3 or <e j 2 , r j u , e j 1 > according to the template.  similarity score between the context and the relation description to exclude irrelevant relations. We first obtain the sentence embedding of the context and relation's description. For the sentence embedding, we use the off the shelf SBERT (Reimers and Gurevych, 2019). Then, we compute cosine similarities between the context and relation's description embeddings. To filter out the relations, we adopt the nucleus sampling (top-p) (Holtzman et al., 2020). After filtering out, we score and choose the triplet with r j u = arg max r j u ∈Ru P (y j |x, t(r j u )), whereR u is the constrained unseen relation set.

Dataset
We evaluate our method on two datasets: FewRel (Han et al., 2018) and Wiki-ZSL (Chen and Li, 2021). FewRel is a dataset for few-shot relation classification. We use the transformed version for zero-shot triplet extraction (Chia et al., 2022). Wiki-ZSL, which is a subset of Wiki-KB, was made for zero-shot relation extraction. Both datasets are made by distant supervision, but FewRel is additionally filtered by human. We adopt the zero-shot setup from Chia et al. (2022) as follows: 1) all relation sets of training, validation, and test are disjoint, 2) we experiment with 3 different settings, where the number of unseen relations m is 5, 10, and 15, and 3) each setting has five different data folds in which relation labels are split with different random seeds: {0, 1, 2, 3, 4}. Table 2 shows statistics of each dataset and setting.

Experimental Settings
Training For the base PLM model, we use pretrained T5 models. We run our experiments on two different sizes of T5 models: T5-small and T5base (60M and 220M parameters). We fine-tune the model 3 epochs with 64 of batch size, 1e-4 of learning rate for T5-small, and 3e-5 for T5-base. We tune hyperparameters based on results on the development set of m=10 and use the same setting for experiments with m=5 and m=15.
Inference As described in Section 3.2, ZETT generates entity spans such that "<X> e 1 <Y> e 2 <Z>" given the input sequence. In entity generation, we used 4 of beam sizes, thus we can have maximum of 4 candidate triplets for each relation. We also constrained the vocabulary set to be generated only from the tokens in the input context since entities are text spans of the input context. For the relation constraint, we set the threshold p on the validation set of m=10 and choose p=0.85 based on the best performance. The ablation study of generation settings is provided in Section 4.6.
Single-and Multi-triplet evaluation In the datasets, each example includes one or more triplets. We evaluate separately on single-and multi-triplet settings. For the single-triplet setting, the examples have only one gold triplet, thus we use the accuracy for the metric. In the multi-triplet setting, the number of gold triplets are 2 or more, thus we evaluate the performance with a F1 score. To retrieve positives in the multi-triplet setting, we set a threshold for the score (eq (1)) and regard as a positive if the score over the threshold. We set a threshold on the validation set of m=10.
Baseline methods We compare the performance of ZETT with three existing methods for triplet extraction: 1) TableSequence (Wang and Lu, 2020) is a joint learning model with two separate encoders performing relation extraction and named entity recognition at the same time. Since TableSequence is designed for supervised learning, we report the results of models trained on synthetic data from Chia et al. (2022). 2) Seq2seq (Chia et al., 2022) is a encoder-decoder model with the pre-trained BART (Lewis et al., 2020). The input of encoder is a context, then the decoder generates a triplet as a sequence of structured template. 3) Relation-Prompt (Chia et al., 2022) is an additionally finetuned model of the Seq2Seq on the synthetic data for the unseen relations. This method requires an extra model for data generation, such as GPT2. Table 3 shows the results of single-and multitriplet settings on the FewRel and Wiki-ZSL. On the single-triplet results, ZETT outperforms existing methods in all settings with both T5-small and T5-base models. When comparing the model sizes, even ZETT with T5-small, which has less than 25% of the parameters of RelationPrompt, outperforms all previous models. With a larger model, ZETT T5-base obtains significant gaps up to 6.35 and 5.79 points in FewRel and Wiki-ZSL still with a smaller size. In the multi triplet test, ZETT shows the best F1 scores in all settings except on the Wiki-ZSL with m=5. However, we observe that the deviations of RelationPrompt's results are relatively large since generated synthetic data is different in every experiment. On the the other hand, ZETT shows stable performance through all trials. The overall results show that ZETT is more effective with a smaller model size and stable with a simpler training process.

ZETT with Automated Templates
Templates provide the benefits of leveraging PLMs as in the form of a natural language sentence. However, making hand-crafted templates requires human labor and it is not scalable when the number of relations is large. Several studies have been introduced methods to automatically generate templates. In this section, we investigate how ZETT works with three different automated templates: mining-based (Jiang et al., 2020), paraphrasingbased (Jiang et al., 2020), and automating prompt generation (Gao et al., 2021).

Methods for automated templates
Mining-based Mining-based generation is based on the distant supervision over the Wikipedia articles. After collecting sentences mentioning both head and tail entities, they use the middle-word rule, which regards middle words between two entities as an indication of the relation, and dependency parsing which obtains the template by mining the phrase span between entity nodes in the tree.   Automating prompt generation Gao et al. (2021) proposed the template generation method as a prediction of the masked span <X> between head and tail entities. The key idea of this approach is generating templates using a few labeled examples and searching the most generalized templates. Given a set of labeled examples D l , template gen-eration can be formulated as follows: We generate multiple templates for each example with a large beam size (e.g., 20). Then we score all templates with the LM probability (eq (1)) with the T5-base model on D l . Lastly, we pick top-k templates with the highest score. This method is originally designed for few-shot learning which assumes that there are n labeled examples for each class. Following Gao et al. (2021), we use n=32 labeled examples for each relation in our experiments. However, we note that these are not used for fine-tuning the model, but only for generating templates automatically.

Experimental settings
For experiments with mining and paraphrase-based templates, we used the templates provided by LPAQA (Jiang et al., 2020). However, not all templates are provided for FewRel and Wiki-ZSL datasets, we use a subset of Wiki-ZSL which relation set is overlapped with LPAQA. As a result, the total relation is 36 and we split it into 26/5/5 for train/dev/test sets. As the same setting of main experiments, we use five data folds with different random seeds and report the average of five data  Table 5: Accuracy of entity extraction and recognition. In entity extraction, we evaluate whether the model exactly predicts head and tail entities, whereas the concept of head and tail is not regarded in entity recognition.
folds. Statistics for each fold are provided in the Appendix (Table 8). Since automated methods provide multiple templates for each relation, we experimented with k ∈ {1, 2, 3, 4, 5} templates and found k=2 to perform the best. Table 4 shows the results of ZETT with automated templates. In experiments with the ZETT T5-small model, we obtained an accuracy of 13.34% with the mining-based templates in the single-triplet setting. This is less than 1% point difference with the manual template's result, and we find that this gap can be overcome with a bigger T5 model, ZETT T5-base .

Results with Automated Templates
In the multi-triplet setting, both the mining and paraphrasing-based performs better than manual templates, particularly improving the recalls. Although the automatically generated templates are noisy and less accurate, we can obtain more diverse templates. We observe that ZETT can be incorporated with automated templates, and multiple templates can help the model generalize better to unseen relations.

ZETT for Entity Extraction
One of the challenges in triplet extraction is accurately detecting entity spans from the input sentence. Since most relations are not symmetric, i.e., (h, r, t) = (t, r, h), identifying which entity should be head or tail is also critical in triplet extraction. Thus, we evaluate the effectiveness of ZETT on the entity extraction task which aims to predict head and tail entities given an input sentence and a relation. ZETT can be used for entity extraction without changing the model or input format by providing the gold relation's template for the input. We also report the results of entity recognition to compare the model's ability to distinguish the head and tail. In entity recognition, we measure whether the model correctly extracts entities regardless of head or tail. Experimental settings are the same as in zero-shot triplet extraction, where we fine-tune For the metric, we use the accuracy with each gold entity and model's prediction.
Results Table 5 shows the results of entity extraction and recognition. Since the RelationPrompt relies on synthetic data, which are noise and not correctly labeled, the performance drops when the model is fine-tuned on synthetic data. Another issue of the RelationPrompt is the formatted text used for triplet generation, such as "Head Entity: <head>, Tail Entity : <tail>, Relation : <relation>". This is easy to fail to identify head and tail entities since there is no information that the target entity is subjective or objective. We also find that formatted text is vulnerable in parsing since models sometimes fail to copy exact format including spaces, commas, and colons. On the other hand, ZETT shows consistent improvements over the previous models. Compared to the results between two tasks, the performance gaps are larger in the entity extraction-up to 11.9 for extraction and 6.0 for recognition. As we addressed in the introduction, this implies ZETT can benefit from natural language templates which provide the implicit constraints of entity types.

Analysis
Ablation Study We conduct an ablation study to examine the importance of each generation setting and the relation constraint. The results without each component are shown in  search. The results show that having multiple candidates gives more chance to find the gold entities lifting the performance by 1.55 points in the Wiki-ZSL. When the relation constraint is not applied, i.g., we compute a score for all relations, the performance dropped 3.5 points in the FewRel. Even though the proposed relation constraint is simple, the results show that filtering irrelevant relations is effective. This implies that there is room for further improvement if we can use better relation classification models (Chen and Li, 2021) for the relation constraint.
Error Analysis We find two major factors that cause our model to fail to predict. First, when the test relation set contains similar relations, models struggle to distinguish them. For example, relations such as country, country of citizenship, and country of origin all represent the concept of country. In this case, the model has difficulty understanding the subtle conceptual differences between the similar relations. Next, since our model uses PLM's conditional probability, the model tends to give high probability to relations that are frequently used in common sentence. We observed that relations such as location, owned by or part of generally have higher scores than specific definitions such as place served by transport hub or located in the administrative territorial entity.

Human Evaluation
The datasets we used in experiments are mainly made by distant supervision (Sorokin and Gurevych, 2017). Main strategy of distant supervision is automatically collecting sentences mentioning head and tail entities, given a factual triplet. However, we observe many examples where the context is not related to the given relation or the triplet cannot be inferred from the context. For example, the context "This lizard lives in the southwestern part of Africa, in Namibia and South Africa." is annotated with the triple (South Africa, shares border with, Namibia). Although this triplet is true, we cannot infer the knowledge from that context. Another type of error is a false negative. We find that some of ZETT's predictions are counted false, although they are actually true. To give an example, for the context "Google released Android 7.1.1 Nougat for the Pixel C in December 2016.", the labeled answer is (Pixel C, operating system, Android). However, our model predicts (Pixel C, owned by, Google), and this is also true. Based on the these observations, we conducted a human evaluation to obtain more precise results. We manually annotated 1,000 examples, top 5 results for 200 contexts. The contexts are randomly sampled from the single-triplet test set of m=10 on the Wiki-ZSL. Annotators are guided to assess top 5 predictions of ZETT T5-base . Three PhDs knowledgeable in this domain participated in the evaluation, and the annotation results in substantial agreement with 0.75 of the Kappa coefficient. We regard the true triplets when two annotators agree that the instance is true. As a result of human evaluation, we found 127 wrong labeled triplets and discovered 50 more true triplets which were labeled as false in the original dataset. When we re-evaluate ZETT T5-base with manually annotated set, the accuracy increases from 18.0 to 30.2. The human annotated test set will be released.

Conclusion
We introduced the ZETT, a new framework for zero-shot triplet extraction which does not need any data augmentation or pipeline systems. We reformulate the triplet extraction as a template infilling problem using natural language templates, which enables the model to better leverage PLMs by aligning pre-training, fine-tuning and inference objectives. ZETT is simple but powerful in leveraging knowledge in PLMs, and also more robust in entity recognition. Through the experiments on two datasets, we demonstrated that ZETT can outperform previous methods with even less than 25% parameters, and neither synthetic data nor pipeline method is necessary for state-of-the-art performance. We also demonstrated that ZETT can be incorporated with automated templates, and it is more robust in extracting entities.