Silver Syntax Pre-training for Cross-Domain Relation Extraction

Relation Extraction (RE) remains a challenging task, especially when considering realistic out-of-domain evaluations. One of the main reasons for this is the limited training size of current RE datasets: obtaining high-quality (manually annotated) data is extremely expensive and cannot realistically be repeated for each new domain. An intermediate training step on data from related tasks has shown to be beneficial across many NLP tasks.However, this setup still requires supplementary annotated data, which is often not available. In this paper, we investigate intermediate pre-training specifically for RE. We exploit the affinity between syntactic structure and semantic RE, and identify the syntactic relations which are closely related to RE by being on the shortest dependency path between two entities. We then take advantage of the high accuracy of current syntactic parsers in order to automatically obtain large amounts of low-cost pre-training data. By pre-training our RE model on the relevant syntactic relations, we are able to outperform the baseline in five out of six cross-domain setups, without any additional annotated data.


Introduction
Relation Extraction (RE) is the task of extracting structured knowledge, often in the form of triplets, from unstructured text.Despite the increasing attention this task received in recent years, the performance obtained so far are very low (Popovic and Färber, 2022).This happens in particular when considering realistic scenarios which include outof-domain setups, and deal with the whole taskin contrast to the simplified Relation Classification which assumes that the correct entity pairs are given (Han et al., 2018;Baldini Soares et al., 2019;Gao et al., 2019).One main challenge of RE and other related Information Extraction tasks is the "domain-specificity": Depending on the text domain, the type of information to extract changes.For example, while in the news domain we can find entities like person and city, and relations like city of birth (Zhang et al., 2017), in scientific texts, we can find information about metrics, tasks and comparisons between computational models (Luan et al., 2018).While high-quality, domain-specific data for fine-tuning the RE models would be ideal, as for many other NLP tasks, annotating data is expensive and time-consuming. 1A recent approach that leads to improved performance on a variety of NLP tasks is intermediate task training.It consists of a step of training on one or more NLP tasks between the general language model pre-training and the specific end task fine-tuning (STILT, Supplementary Training on Intermediate Labeled-data Tasks; Phang et al., 2018).However, STILT assumes the availability of additional high quality training data, annotated for a related task.
In this paper, we explore intermediate pretraining specifically for cross-domain RE and look for alternatives which avoid the need of external manually annotated datasets to pre-train the model on.In particular, we analyze the affinity between syntactic structure and semantic relations, by considering the shortest dependency path between two entities (Bunescu and Mooney, 2005;Fundel et al., 2006;Björne et al., 2009;Liu et al., 2015).We replace the traditional intermediate pre-training step Linear-fractional programming ( LFP ) is a generalization of linear programming ( LP ) .  on additional annotated data, with a syntax pretraining step on silver data.We exploit the high accuracy of current syntax parsers, for obtaining large amount of low-cost pre-training data.The use of syntax has a long tradition in RE (Zhang et al., 2006;Qian et al., 2008;Nguyen et al., 2009;Peng et al., 2015).Recently, work has started to infuse syntax during language model pre-training (Sachan et al., 2021) showing benefits for RE as well.Syntactic parsing is a structured prediction task aiming to extract the syntactic structure of text, most commonly in the form of a tree.RE is also a structured prediction task, but with the aim of extracting the semantics expressed in a text in the form of triplets-entity A, entity B, and the semantic relation between them. 3We exploit the affinity of these two structures by considering the shortest dependency path between two (semantic) entities (see Figure 1).The idea we follow in this work is to pre-train an RE baseline model over the syntactic relations-Universal Dependency (UD) labels-which most frequently appear on the shortest dependency paths between two entities (black bold arrows in Figure 2).We assume these labels to be the most relevant with respect to the final target task of RE.In order to feed the individual UD relations into the RE baseline (model details in Section 3.1) we treat them similarly as the semantic connections.In respect to Figure 2, we can formalize the semantic relations as the following triplets: • NAMED(LFP,Linear-fractional programming) • TYPE-OF(linear programming,Linear-fractional programming) • NAMED(LP,linear programming).
In the next section we describe the detailed training process.

Setup
Data In order to evaluate the robustness of our method over out-of-domain distributions, we experiment with CrossRE (Bassignana and Plank, 2022),4 a recently published multi-domain dataset.CrossRE includes 17 relation types spanning over six diverse text domains: news, politics, natural science, music, literature and artificial intelligence (AI).The dataset was annotated on top of a Named  Entity Recognition dataset-CrossNER (Liu et al., 2021)-which comes with an unlabeled domainrelated corpora. 5We used the latter for the syntax pre-training phase.

UD Label Selection
In order to select the UD labels which most frequently appear on the shortest dependency path between two semantic entities, we parsed the training portions of CrossRE.Our analysis combines RE annotations and syntactically parsed data.We observe that the syntactic distance between two entities is often higher than one (see Figure 4), meaning that the shortest dependency path between two entities includes multiple dependencies-in the examples in Figure 1, the one above has distance one, the one below has distance two.However, the shortest dependency paths contain an high frequency of just a few UD labels (see Figure 3) which we use for syntax pre-training: nsubj, obj, obl, nmod, appos.See Appendix A for additional data analysis.To do so, 1 we sample an equal amount of sentences from each domain6 (details in Section 4), and 2 use the MaChAmp toolkit (van der Goot et al., 2021) for inferring the syntactic tree of each of them.We apply an additional sub-step for disentangling the conj dependency, as illustrated in Appendix C.Then, 3 we filer in only the nsubj, obj, obl, nmod, and appos UD labels and 4 feed those connections to the RE model (as explained in the previous section).Within the RE model architecture described above, each triplet corresponds to one instance.In this phase, in order to assure more variety, we randomly select a maximum of five triplets from each pre-train sentence.

Model
In the second training phase-the fine-tuning one-we replace the classification head (i.e. the feed-forward layer) with a new one, and individually train six copies of the model over the six train sets of CrossRE.Note that the encoder is fine-tuned in both training phases.Finally, we test each model on in-and out-of-domain setups.

Results
Table 1 reports the results of our cross-domain experiments in terms of Macro-F1.We compare our proposed approach which adopts syntax pre-training with the zero-shot baseline model. 7ive out of six models outperform the average of the baseline evaluation, including in-and out-ofdomain assessments.The average improvementobtained without any additional annotated RE data-is 0.71, which considering the low score range given by the challenging dataset (with limited train sets, see dataset size in Appendix D), and the cross-domain setup, is considerable.The model fine-tuned on the news domain is the only one not outperforming the baseline.However, the performance scores on this domain are already extremely low for the baseline, because news comes from a different data source with respect to the other domains, has a considerable smaller train set, and present a sparse relation types distribution, making it a bad candidate for transferring to other domains (Bassignana and Plank, 2022).
As comparison, we report the scores obtained with the traditional intermediate pre-training which includes additional annotated data.We pre-train the language encoder on SciERC (Luan et al., 2018), a manually annotated dataset for RE.SciERC contains seven relation types, of which three overlap  with the CrossRE relation set.In this setup, the improvement over the baseline includes the news, but not the literature domain.Nevertheless, while the gain is on average slightly higher with respect to the proposed syntax pre-training approach, it comes at a much higher annotation cost.

Pre-training Data Quantity Analysis
We inspect the optimal quantity of syntactic data to pre-train our RE model on by fine-tuning this hyperparameter over the dev sets of CrossRE.The plot in Figure 5 reports the average performance of the six models when pre-trained on increasing amounts of syntactic dependencies. 8Starting from 8.4K instances onward, the performance stabilizes above the baseline.We select the peak (20.4K, albeit results are similar between 18-20.4K) for reporting our test set results in Table 1.While we are interested in the robustness of our method across multiple domains, and therefore consider the average (Figure 5), domain-optima could be achieved by examining individual domain performance.As example, we report in Figure 6 the plot relative to the model fine-tuned on AI, which is the one obtain-8 Pre-training performance in Appendix E.
ing the highest gain.The model fine-tuned on AI generally gains a lot from the syntax pre-training step, with its peak on 15.6K pre-training instances.

Conclusion
We introduce syntax pre-training for RE as an alternative to the traditional intermediate training which uses additional manually annotated data.We pretrain our RE model over silver UD labels which most frequently connect the semantic entities via the shortest dependency path.We test the proposed method over CrossRE and outperform the baseline in five out of six cross-domain setups.Pre-training over a manually annotated dataset, in comparison, only slightly increases our scores in five out of six evaluations, but at a much higher cost.

Limitations
While we already manage to outperform the baseline, the pre-training data quantity is relatively small (∼20K instances).Given the computational cost of training 30 models-six train sets, over five random seeds each-and testing them within inand cross-domain setups, we break the inspection of the optimal pre-training data amount at 24K instances.However we do not exclude that more pre-training would be even more beneficial for improving even more over the baseline.
Related to computation cost constrains, we test our syntax pre-training approach over one set of UD labels only (nsubj, obj, obl, nmod, appos).Different sets could be investigated, e.g.including acl and compound, which present a lower, but still considerable amount of instances (see Figure 3).
Finally, while approaching RE by assuming that the gold entities are given is a common area of research, we leave for future work the inspection of the proposed method over end-to-end RE.

C Handling of Conj
In UD, the first element in a conjuncted list governs all other elements of the list via a conj dependency and represents the list syntactically w.r.t. the remainder of the sentence.CrossRE (Bassignana and Plank, 2022) relations, on the other hand, directly link the two entities involved in the semantic structure.To account for this difference, we propagate the conjunction dependencies in order to reflect the semantic relations, as shown in Figure 7.

D CrossRE Size
We report in Table 5 the dataset statistics of CrossRE (Bassignana and Plank, 2022) including the number of sentences and of relations.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.

E Syntax Pre-training Performance
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: Syntactic and Semantic Structures Affinity.Shortest dependency path (above), and semantic relation (below) between two semantic entities.
programming ( LFP ) is a generalization of linear programming ( LP ) .

Figure 2 :
Figure 2: Pre-training Example.Given the dependency tree (above), we filter in for pre-training only the UD labels which are on the shortest dependency path between two semantic entities (below).

Figure 3 :
Figure 3: UD Label Distribution Over the Shortest Dependency Paths.Statistics of the UD labels which are on the shortest dependency path between two entities over the six train sets of CrossRE (Bassignana and Plank, 2022).

Figure 4 :
Figure 4: Shortest Dependency Path Length.Statistics of the shortest dependency path length between two entities over the train sets of CrossRE (Bassignana and Plank, 2022).
Our RE model follows the current stateof-the-art architecture by Baldini Soares et al., 2019 which augments the sentence with four entity markers e start 1 , e end 1 , e start 2 , e end 2 before feeding it into a pre-trained encoder (BERT; Devlin et al., 2019). 5Released with an MIT License.The classification is then made by a 1-layer feedforward neural network over the concatenation of the start markers [ŝ e start 1 , ŝe start 2 ].We run our experiments over five random seeds and report the average performance.See Appendix B for reproducibility and hyperparameters settings of our model.Training The training of our RE model is divided into two phases.In the first one-which we are going to call syntax pre-training-we use the unlabeled corpora from CrossNER for pre-training the baseline model over the RE-relevant UD labels.

Figure 5 :
Figure 5: Pre-train Data Quantity Analysis.Average (dev) performance of the six models when pre-trained over increasing amounts of syntactic instances.

Figure 6 :
Figure 6: Per-Domain Pre-train Data Quantity Analysis.Individual (dev) performance of the model finetuned on AI when pre-trained over increasing amounts of syntactic instances.

Figure 8
Figure 8 reports the performance of the RE model during the syntax pre-training phase, over increasing amounts of pre-training dependency instances.The scores are computed on a set including 600 sentences (100 per domain) not overlapping with the train set used in the syntax pre-training phase.

Figure 8 :
Figure 8: Pre-train Performance.Pre-train performance of the RE model over increasing amounts of dependency instances

Table 1 :
Performance Scores.Macro-F1 scores of the baseline model, compared with the proposed syntax (Luan et al., 2018)ch, and-as comparison-with the traditional pre-training over the manually annotated Sci-ERC dataset(Luan et al., 2018).

Table 2 :
(Bassignana and Plank, 2022)he Shortest Dependency Paths per Relation Type.Statistics of the UD labels which are on the shortest dependency path between two entities divided by the 17 relation types of CrossRE(Bassignana and Plank, 2022).

Table 3 :
(Bassignana and Plank, 2022)gth per Relation Type.Statistics of the shortest dependency path length between two semantic entities divided by the 17 relation types of CrossRE(Bassignana and Plank, 2022).

Table 4 :
Hyperparameters Setting.Model details for reproducibility of the baseline.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section 3, Section 4, and Appendix B.C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 3.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Not applicable.-D Did you use human annotators (e.g., crowdworkers) or research with human participants?-D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.-D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.