Language Model Analysis for Ontology Subsumption Inference

Investigating whether pre-trained language models (LMs) can function as knowledge bases (KBs) has raised wide research interests recently. However, existing works focus on simple, triple-based, relational KBs, but omit more sophisticated, logic-based, concep-tualised KBs such as OWL ontologies. To investigate an LM’s knowledge of ontologies, we propose O NTO LAMA, a set of inference-based probing tasks and datasets from ontology subsumption axioms involving both atomic and complex concepts 1 . We conduct extensive experiments on ontologies of different domains and scales, and our results demonstrate that LMs encode relatively less background knowledge of Subsumption Inference (SI) than traditional Natural Language Inference (NLI) but can improve on SI signiﬁcantly when a small number of samples are given. We will open-source our code and datasets. 2


Introduction
The advancements of large pre-trained language models (LMs) have sparked research interests in investigating how much explicit semantics LMs can learn or infer from knowledge bases (KBs) (AlKhamissi et al., 2022). The LAMA (LAnguage Model Analysis) probe (Petroni et al., 2019) is among the first works that adopt prompt-based methods to simulate the process of querying relational knowledge from various KBs such as Con-ceptNet (Speer and Havasi, 2012) and GoogleRE 3 . Some subsequent studies focus on probing specific types of knowledge from sources like commonsense KBs (Da et al., 2021), biomedical KBs (Sung 1 An ontology concept is also known as a class. To avoid confusion with class in machine learning classification, we stick to use the term concept. 2 Code and Instructions: https://krr-oxford.github. io/DeepOnto/ontolama; Dataset at HuggingFace: https:// huggingface.co/datasets/krr-oxford/OntoLAMA/ or at Zenodo: https://doi.org/10.5281/zenodo.6480540 3 https://code.google.com/archive/p/ relation-extraction-corpus/ et al., 2021), temporal KBs (Dhingra et al., 2022), and cross-lingual KBs (Liu et al., 2021a). However, existing "LMs-as-KBs" works focus on simple, triple-based, relational KBs, but neglect more formalised, logic-based, conceptualised KBs. For example, a statement like "London is the capital of the UK" can be expressed in the triple (London, capitalOf, UK); but a sentence like "arthritis is a kind of arthropathy with an inflammatory morphology", which describes the concept "arthritis", cannot be easily expressed using just triples. Conceptual knowledge like this requires a formal and expressive representation to be defined precisely. A well-known model for conceptual knowledge is the OWL 4 ontology (Bechhofer et al., 2004;Grau et al., 2008), which can be seen as a description logic (DL) KB with rich built-in vocabularies for knowledge representation and various reasoning tools supported. Taking the example of "arthritis", in DL the concept can be described as Arthritis Arthropathy ∃hasM orphology.Inf lammatory.
In this work, we take a further step along the "LMs-as-KBs" research line towards more formalised semantics by targeting DL KBs and in particular the OWL ontologies. Current works on LMs concerning ontologies are mostly driven by a target application. Liu et al. (2020), He et al. (2022),  apply language model finetuning to address ontology curation tasks such as concept insertion and matching, while Ye et al. (2022) transform ontologies into graphs for data augmentation in few-shot learning. In contrast to these application-driven approaches, we investigate a more fundamental question: To what extent can LMs infer conceptual knowledge modelled by an ontology? Particularly, we focus on the subsumption relationships between ontology concepts. As shown in Figure 1, we first extract concept pairs (C, D) that are deemed as positive (C and D are in a subsumption relationship) and negative (C and D are assumed to be disjoint) samples from an ontology. Note that the sampling procedure is fully automatic with the syntax and semantics of OWL ontology carefully considered. To translate the concepts and especially the ones with complex logical expressions into natural language texts, we develop a recursive concept verbaliser. We formulate the Subsumption Inference (SI) task similarly to the Natural Language Inference (NLI) task and treat the concept pairs as premise-hypothesis pairs (Padó and Dagan, 2022), which will then be wrapped into a template for generating inputs of LMs.
We have created SI datasets from ontologies of various domains and scales, and conducted extensive experiments. Our results demonstrate that LMs perform better on a typical NLI task than the constructed SI tasks under the zero-shot setting, indicating that LMs encode relatively less background knowledge of ontology subsumptions. However, by providing a small number of samples (K-shot settings), the performance on SI is significantly improved. This observation is consistent with the three LMs that are studied in this work.

OWL Ontology
An OWL ontology is a description logic (DL) knowledge base that consists of the TBox (terminological), ABox (assertional), and RBox (relational) axioms (Krötzsch et al., 2012). In this work, we focus on the TBox axioms which specify the subsumption relationships between concepts of a domain. A subsumption axiom has the form of C D where C and D are concept expressions involving atomic concept, negation (¬), conjunction ( ), disjunction ( ), existential restriction (∃r.C), universal restriction (∀r.C), and so on (see complete definition in Appendix A). An atomic con-cept is a named concept, a top concept (a concept with every individual as an instance), or a bottom concept ⊥ (an empty concept); while a complex concept consists of at least one of the available logical operators. An equivalence axiom C ≡ D is equivalent to C D and D C.
Regarding the semantics, in DL we define an interpretation I = (∆ I , · I ) that consists of an nonempty set ∆ I and a function · I that maps each concept C to C I ⊆ ∆ I and each property r to r I ⊆ ∆ I × ∆ I . We say I is a model of C D if C I ⊆ D I holds, and I is a model of an ontology O if I is a model of all axioms in O. If C I ⊆ D I holds for every model I of O, then we can say O |= C D. This defines logical entailment w.r.t. an ontology and it is more strictly defined than textual entailment based on human beliefs.
An individual a is an instance of a concept C in which means there can be no common instance a of C and D.
The Open World Assumption (OWA) underpins OWL ontologies, according to which we cannot say what is not entailed by the ontology is necessarily false. For example, if we have an ontology that contains just one axiom P aella ∃hasIngredient.Chicken, in OWA we cannot determine if paella can have chorizo as an ingredient or not. To allow reuse and extension, ontologies are often (intentionally) underspecified (Cimiano and Reyle, 2003); this characteristic motivates how we define the negative samples in Section 3.1.

Related Work
Recently, the rise of the prompt learning paradigm has shed light on better usage of pre-trained LMs without, or with minor, supervision . However, LMs are typically pre-trained in a stochastic manner, making it challenging to study what knowledge LMs have implicitly encoded (Petroni et al., 2019) and how to access LMs in an optimal or cotrollable way. (Gao et al., 2021;. Our work is informed by the "LMs-as-KBs" literature (AlKhamissi et al., 2022), where different probes have been designed to test LMs' knowledge of relational data. In Petroni et al. (2019), the probing task of world knowledge has been formulated as a cloze-style answering task where LMs are required to fill in the <MASK> token given in-put texts wrapped into a manually designed template. Sung et al. (2021) did a similar work but shift the focus to (biomedical) domain knowledge of domain-specific LMs. Liu et al. (2021a) pretrained LMs with multi-lingual knowledge graphs (KGs) and test on the cross-lingual tasks. Dhingra et al. (2022) proposed datasets with temporal signals and probed LMs on them with templates generated by the text-to-text transformer T5 (Raffel et al., 2022).
However, existing "LMs-as-KBs" works mostly focus on relational facts, but omit logical semantics and conceptual knowledge. In contrast, our work focuses on OWL ontologies which represent conceptual knowledge with an underlying logical formalism. Although there are some recent works concerning both LMs and ontologies, they do not compare them at the semantic level but rather emphasise on downstream applications. For example, He et al. (2022) adopted LMs as synonym classifiers to predict mappings between ontologies; whereas Ye et al. (2022) used ontologies to provide extra contexts to help LMs to make predictions.

Task Definition
Recall the definitions in Section 2.1, a subsumption axiom C D can be interpreted as: "every instance of C is an instance of D". We can accordingly form a premise-hypothesis pair where the premise is "x is a C" and the hypothesis is "x is a D" for some individual x. Note that there are different ways to express the premise and hypothesis, and we adopt a simple but effective one (see Section 5.1). Next, an ontology verbaliser is required for transforming the concept expressions C and D into natural language texts. Analogous to Natural Language Inference (NLI) or Recognising Textual Entailment (RTE) (Poliak, 2020;Padó and Dagan, 2022), the task of Subsumption Inference (SI) is thus defined as classifying if the premise entails or does not entail the hypothesis. Note that SI is similar to a two-way RTE task 5 where we do not consider the neutral 6 class.
Given an ontology O, we extract positive and negative subsumptions to probe LMs. The positive samples are concept pairs (C, D) with O |= C D. Due to OWA, we cannot determine if (C, D) with O |= C D really forms a negative subsumption (see Appendix F for more explanation); to generate plausible negative samples, we propose the assumed disjointness 7 defined as follows: Definition (Assumed Disjointness). If two concepts C and D are satisfiable in O ∪ {C D ⊥} and there is no named atomic concept A in O such that O |= A C and O |= A D, then C and D are assumed to be disjoint.
The first condition ensures that C and D are still satisfiable after adding the disjointness axiom for them into O whereas the second condition ensures that C and D have no common descendants because otherwise the disjointness axiom will make any common descendant unsatisfiable. If two concepts C and D satisfy these two conditions, we treat (C, D) as a valid negative subsumption.
However, in practice validating the satisfiability for each concept pair (C, D) would be inefficient especially when the ontology is large and complex. Thus, we propose a pragmatical alternative to the satisfiability check in Appendix E.
To conduct reasoning to extract entailed positive subsumptions and validate sampled negative subsumptions, we need to adopt a proven sound and complete OWL reasoner, e.g., HermiT (Glimm et al., 2014).
In the following sub-sections, we propose two specific SI tasks and their respective subsumption sampling methods.

Atomic Subsumption Inference
The first task aims at subsumption axioms that involve just named atomic concepts. Such axioms are usually the most prevalent in an ontology and can be easily verbalised by using the concept names. In this work, we use labels (in English) defined by the built-in annotation property rdfs:label as concept names.
The positive samples are extracted from all entailed subsumption axioms of the target ontology. We consider two types of negative samples: (i) soft negative composed of two random concepts, and (ii) hard negative composed of two random sibling concepts. Two sibling concepts lead to a "hard" negative sample because they share a common parent (thus having closer semantics) but are

Pattern Verbalisation (V)
A (atomic) the name (rdfs:label) of A r (property) the name (rdfs:label) of r, subject to rules in Appendix C ¬C "not V(C)" C1 ... Cn if Ci = ∃/∀r.Di and Cj = ∃/∀r.Dj, they will be re-written into ∃/∀r.(Di Dj) before verbalisation; suppose after rewriting the new expression is C1 ... C n (a) if all Cis (for i = 1, ..., n ) are restrictions, in the form of ∃/∀ri.Di: are restrictions, in the form of ∃/∀ri.Di: "V(C1) and ... and V(Cm) that V(rm+1) some/only V (Dm+1) and ... and V(r n ) some/only V (D n )" (c) if no Ci is a restriction: "V(C1) and ... and V(C n )" C1 ... Cn similar to verbalising C1 ... Cn except that "and" is replaced by "or" and case (b) uses the same verbalisation as case (c) Table 1: Recursive rules for verbalising a complex concept expression C in OWL ontologies. Note that C i in the conjunction/disjunction pattern is also an arbitrary complex concept.
often disjoint. The sampled pairs need to meet the assumed disjointness defined in Section 3.1 to be accepted as valid negatives. We first sample equal numbers of soft and hard negatives and then randomly truncate the resulting set into the size of the positive sample set to keep class balance.

Complex Subsumption Inference
In the second SI task, we consider subsumption axioms that involve complex concepts. Particularly, we choose equivalence axioms of the form A ≡ C 8 (where A and C are atomic and complex concepts, respectively) as anchors, and equivalently transform them into subsumption axioms of the forms A C and C A, through which complex concepts can appear in both the premise and hypothesis sides.
Recursive Concept Verbaliser To transform a complex C into a natural language text, we develop the recursive concept verbaliser consisting of a syntax tree parser and a set of recursive rules (see Table 1). A concrete example is shown in 8 Equivalence axioms of this form are referred to as the definition of the named concept, and are common in OWL.    Table 1. The leaf nodes are either atomic concepts or properties and they are verbalised by their names. At each recursive step, verbalised child nodes are merged according to the logical pattern in their parent node. Note that we enforce some extent of redundancy removal for the conjunction ( ) and the disjunction ( ) patterns. Taking the example in Figure 2, the verbalised atomic concept "meat" is placed before "that" as an antecedent, and the verbalised conjunction of two restrictions is placed after "that" as a relative clause. "meat" can be replaced by "something" if the concept M eat is not involved. Moreover, if two restrictions with the same quantifier and property are connected by or , they will be merged into one restriction. For example, ∃derivesF rom.Cattle ∃derivesF rom.Sheep will be transformed into ∃derivesF rom.(Cattle Sheep).
We extract equivalence axioms in the form of A ≡ C from the target ontology. Taking each such axiom as an anchor, we can obtain positive complex subsumption axioms of the form A sub C or C A super where A sub and A super are a subclass and a super-class of A, respectively. To derive challenging negative samples, we first randomly re- Without loss of generality, we assume the random replacement leads to case (ii). We then check if A and C satisfy the assumed disjointness as described in Section 3.1. In the affirmative case, we can have either A C or C A as the final negative subsumption; otherwise, we skip this sample. For example, given Sunf lowerSeed ≡ Seed ∃DerivesF rom.HelianthusAnnuus, a possible negative subsumption is Sunf lowerSeed F ruit ∃DerivesF rom.HelianthusAnnuus if Seed in C is replaced by F ruit to create C .

Datasets
In this work, we consider ontologies of different domains and scales including: • Schema.org 9 (released on 2022-03-17): a general-purpose ontology that maintains a basic schema for structured data on the Web; • DOID 10 (released on 2022-09-29): an ontology for human diseases (Schriml et al., 2012); • FoodOn 11 (released on 2022-08-12): an ontology specialised in food-related knowledge including food products, food sources, food nutrition, and so on (Dooley et al., 2018).
• GO 12 (released on 2022-11-03): a very finegrained and widely used biomedical ontology specialised in genes and gene functions (Ashburner et al., 2000). We used the most updated versions at the time of experiment. The details for pre-processing the ontologies are illustrated in Appendix B. 9 https://schema.org/ 10 https://disease-ontology.org/ 11 https://foodon.org/ 12 http://geneontology.org/ We construct an Atomic SI dataset for each ontology, but Complex SI datasets are created for FoodOn and GO only, due to their abundance of equivalence axioms. To avoid too many repetitive concept expressions brought by a particular equivalence axiom, we sample at most 4 positive and 4 negative samples for each equivalence axiom in the Complex SI setting. To attain class balance, we purposely keep the number of negative samples the same as the positive samples in each data split. For most of the resulting datasets, we divide each into 8 : 1 : 1 for training, development, and testing; for the Schema.org's Atomic SI and the FoodOn's Complex SI datasets, which are relatively smaller, we apply a 2 : 1 : 7 division instead. Note that we mainly focus on K-shot settings in the probing study, thus the required training and development sample sets are small.
To compare with how LMs perform on traditional NLI, we additionally create biMNLI, a subset of the Multi-Genre Natural Language Inference (MNLI) corpus (Williams et al., 2018) where (i) the neutral class and its samples are removed, (ii) the Matched and Mismatched testing sets are merged into one testing set, (iii) 10% of the training data is used as the development set, and (iv) the entailmentcontradiction ratio is set to 1 : 1 (by discarding extra samples from the dominant class) for a balanced prior. The numbers of named concepts and equivalence axioms in ontologies, and the numbers of samples in (each split of) SI datasets and the biMNLI dataset are reported in Table 2.

Prompt-based Inference
To conduct the inference task under the promptbased settings, we wrap the verbalised subsumption axioms and the <MASK> token into a template to serve as inputs of LMs. We opt to use different combinations of manually designed templates 13 (T 1 and T 2 ) and label words (L 1 to L 3 ) that have achieved promising results on the NLI tasks (Schick and Schütze, 2021;Gao et al., 2021) as follows: where <A> is "a", "an", or just blank depending on the next word 14 , V(·) is the concept verbalisation function defined in Section 3, and <MASK> is the token that LMs need to predict. The probability of predicting class y ("positive" or "negative") for an input sample x = (C, D) is defined as: where L j [·] and L j [y] denote all the label words defined in L j and the label words of class y defined in L j , respectively; T i (C, D) denotes the transformed texts of concepts C and D through the template T i , w v and w w are vectors for the label words v and w, respectively; and h <MASK> denotes the hidden vector of the masked token. The prediction can be trained by minimising the cross-entropy loss.
For the biMNLI dataset, the premise and hypothesis are replaced by what were originally given in the dataset -except that we have removed trailing punctuations.
In the main experiments concerning language models, we consider all the combinations of T i and L j and additionally consider 3 random seeds (thus 18 experiments each) for K-shot settings where K > 0. The value of K refers to the number of samples per classification label (positive or negative) we randomly extract from training and development sets, respectively. For K = 0 (zero-shot), 13 We make slight modifications by adding the prefix "It/it is <A>" to make premise and hypothesis sentences complete.
14 "an" is used when the next word starts with a vowel; leaving it blank when the next word is "something". different random seeds do not affect the results. For the fully supervised setting, we consider only one random seed and one combination (T 1 and L 1 ) because our pilot experiments demonstrate that fine-tuning on large samples results in low variance brought by different random seeds and different combinations of templates and label words.
Our code implementations mainly rely on The OWL API 15 for ontology processing and reasoning, and OpenPrompt 16 for prompt learning (Ding et al., 2022). Training of each K-shot (where K > 0) experiment takes 10 epochs, while for the fully supervised setting involving very large training samples, we only train for 1 epoch. 17 The best-performing model on the development set (at each epoch) is selected for testing set inference. We use the AdamW optimiser (Loshchilov and Hutter, 2019) with the initial learning rate, weight decay, and the number of warm-up steps set to 10 −5 , 10 −2 , and 50, respectively. All our experiments are conducted on two Quadro RTX 8000 GPUs.

Results and Analysis
LMs and Settings We choose LMs from the RoBERTa family (Liu et al., 2019) as they are frequently introduced in cloze-style probing tasks (Liu et al., 2021b;Sung et al., 2021;Kavumba et al., 2022). In Table 3, we present key experiment results for roberta-large and roberta-base; we have a further ablation study for a biomedical variant of roberta-large in the latter paragraph.
For both LMs in Table 3, we report results of K-shot settings with K ∈ {0, 4, 32, 128}. We additionally present the results of the fully supervised setting for roberta-large as the oracle. For each setting, we report the averaged accuracy and standard deviation (where applicable). To clearly observe how the performance varies as K increases, we present Figure 3 which visualises the K-shot results for roberta-large with additional values of K ({8, 16, 64}). The complete result table for both language models and the figure that visualises the performance of roberta-base are available in Appendix G.
Baselines As aforementioned, we purposely keep class balance in each data split, thus the accuracy scores for majority vote are all 50.0%. Besides,  Table 3: Results for the biMNLI, Atomic SI, and Complex SI tasks with each cell stating "mean accuracy (standard deviation)" except for majority vote and the fully supervised settings where standard deviation is not available. we consider word2vec (Mikolov et al., 2013) pretrained on GoogleNews 18 with a logistic regression classifer as a baseline model, which demonstrates how a classic non-contextual word embedding model performs on the SI tasks. For this baseline, we only report results for K ∈ {4, 128} as the increase of K does not bring significant change and results of K = 128 are roughly comparable to results of K = 4 for roberta-large. This suggests that the SI sample patterns are not easily captured with word2vec.
SI vs biMNLI From the results, we first observe that both roberta-large and roberta-base achieve better zero-shot results on biMNLI than on all the SI datasets by at least 7.0% and 6.1% respectively, 18 https://code.google.com/archive/p/word2vec/ showing that under our prompt settings, both LMs encode better background knowledge on biMNLI than SI. However, as K grows, the performances on both biMNLI and SI improve consistently and significantly (while the standard deviation generally reduces), and we can see at K = 32, the mean accuracy scores on the Atomic SI tasks have surpassed biMNLI for roberta-base. At K = 64 (see Figure 3), the mean accuracy scores on biMNLI and all the Atomic SI tasks converge to around 90.0%; the scores on the two Complex SI tasks are also above 80.0% for both LMs. Moreover, roberta-large consistently attains a better score than roberta-base for every setting.  almost as majority vote on the Complex SI dataset of GO; at K = 128, roberta-large attains 88.4% on the Complex SI dataset of FoodOn while it attains more than 90% for the others. We can also observe from Figure 3 that the scores on Complex SI tasks are generally lower than those on the Atomic SI tasks. Among the Atomic SI tasks, we find that GO is the most challenging which is as expected because GO is a fine-grained expert-level ontology. However, it surprises us that at K = 32 the score (92.3%) on DOID is better than all other tasks, considering that DOID is a domain-specific ontology.
Domain-specific SI We conduct further experiments for domain-specific LMs on domain-specific SI tasks. Specifically, we consider the variant roberta-large-pm-m3-voc which has been pre-trained on biomedical corpora PubMed abstracts, PMC full-text, and MIMIC-III clinical notes with an updated sub-word vocabulary learnt from PubMed (Lewis et al., 2020). In Table 4, we present the K-shot results of roberta-large-pm-m3-voc on three SI tasks related to biomedical ontologies DOID and GO. The zero-shot scores are almost equivalent to majority vote but the performance improves more prominently than robertalarge on the Atomic SI tasks of DOID and GO as K increases. Surprisingly, the Complex SI setting of GO seems to be quite challenging to this biomedical variant of RoBERTa. For example, at K = 4, the score is not improved compared to K = 0.
Template and Label Words The access to LMs is an influential factor of performance especially when there are no or fewer training samples. For example, roberta-large attains a standard deviation of 8.8% for K = 0 on FoodOn's Atomic SI task, suggesting that there is a significant performance fluctuation brought by different combinations of templates and label words. Although the standard deviation on GO's Complex SI is just 0.6%, the corresponding accuracy score (50.4%) indicates that none of these combinations work. Furthermore, effective template or label words are not transfer-able from one LM to another, as we can observe from the bad performance of roberta-large-pm-m3-voc for K = 0 on the SI tasks of biomedical ontologies. These observations suggest that either we did not find a generalised template and label words combination, or LMs require customised access for different types of knowledge.

Conclusion and Discussion
As a work that introduces ontologies to the "LMsas-KBs" collection, this paper emphasises on how to establish a meaningful adaptation from logical expressions to natural language expressions, following their formal semantics. To this end, we leverage the Natural Language Inference (NLI) setting to define the Subsumption Inference (SI) task with careful considerations to address the differences between textual entailment and logical entailment. We also develop the recursive concept verbaliser for OWL ontologies as an auxiliary tool.
Our results demonstrate that with our SI set-ups, LMs can successfully learn to infer both atomic and complex subsumptions when a small number of annotated samples are provided. This paves the way for investigating more complex reasoning tasks with LMs or guiding LMs using ontology semantics with limited training.
In fact, the current SI setting is not the only way for probing subsumption knowledge of an ontology; for example we can directly verbalise C D as "V(C) is a kind of V(D)" and formulate the probing task similar to fact-checking or equivalently, an inference task with empty premises. However, our pilot experiments demonstrate that such setting is not as effective as the current SI setting.
The presented work brings opportunities for future work as (i) the proposed ontology verbalisation method has not covered all possible patterns of complex concepts (e.g., with cardinality restrictions and nominals); (ii) we have not fully considered textual information such as synonyms, definitions, and comments, that are potentially available in an ontology; (iii) we have considered only TBox (terminological) axioms, but ABox (assertional) axioms can be involved in, e.g., the membership prediction task, where the objective is to classify which concept an individual belongs to. Therefore, developing a robust tool for verbalising logical expressions and extending the ontology inference settings are potential next tasks. Another interesting line for the near future is to train an LM using on-tologies with their logical semantics considered. The resulting LM is expected to be applicable to different downstream ontology curation tasks such as ontology matching and entity linking, with fewer samples necessary for fine-tuning.

Limitations
As we mainly focus on conceptual knowledge captured in so-called TBox (terminological) axioms, the ABox (assertional) axioms are not considered. ABox axioms can capture situations for specific individuals (e.g., health status of a person) which could cause privacy issue and we would not expect LMs to capture such knowledge. Hence, dealing with ABox axioms could require additional engineering for data preprocessing.

Ethical Considerations
In this work, we construct new datasets for the proposed Subsumption Inference (SI) task from publicly available ontologies: Schema.org, DOID, FoodOn, and GO, with their download links specified in Section 4. The biMNLI dataset is constructed from the existing open-source MNLI dataset. We have confirmed that there is no privacy or license issue in all these datasets.

B Ontology Preprocessing
In case that some of the ontologies we use in this work contain meaningless (e.g., obsolete) concepts regarding subsumption sampling and/or contain concept names (or aliases) that are apparently unnatural, we apply a general preprocessing procedure to all the ontologies, and then conduct individual preprocessing for each ontology.

General Preprocessing
• Remove obsolete concepts (which are indicated by the built-in annotation property owl:deprecated) and apparently redundant concepts such as foodOn:stupidType. • Use rdfs:label as the main annotation property to extract concept names except when its literal value is not available. The extracted concept names are lower-cased and any underscores "_" in them are removed.

Individual Preprocessing
• Schema.org: concept names (defined in this ontology are in the Java-identifier style; thus, they are parsed into natural expressions, e.g., "APIReference" to "API Reference". • DOID: remove the concept doid:Disease because it is a general concept just below the root concept owl:Thing which will lead to Constructor Semantics (a) [0-9]+ -(.*) \(.+\) followed by removal of leading and trailing whitespaces. Note that concepts in this ontology sometimes have an empty literal given by rdf:label; in these cases, the annotation properties obo:hasSynonym and obo:hasExactSynonym are used instead.
• GO: no individual processing.

C Object Property Verbalisation
Different from verbalising an atomic concept where we simply use its name (or alias), we enforce some simple rules to verbalise an object property for a basic grammar fix. If the property name starts with a passive verb, adjective, or noun, we append "is" to the head. For example, "characteristic of" is changed to "is characteristic of"; "realised in" is changed to "is realised in". Note that the word's part-of-speech tag is automatically determined using the Python library Spacy 19 .

D Complex Concept Verbalisation Examples
For clearer understanding of how our verbalisation approach works, we present some typical exam- BioRegulation ∃negRegulate.P rolineBiosynP roc "biological regulation that negatively regulates some proline biosynthetic process" ApoptoticP roc ∃partOf.Luteolysis "apoptotic process that is part of some lutelysis" ConcnOf ∃charOf.(f ucose ∃partOf.M aterialEnt) "concentration of something that is characteristic of some fucose that is part of some material entity" ∃derivesF rom.(T imothyP lant T rif oliumP ratense) P lantF oodP rod Silage "silage and plant food product that derives from some timothy plant or trifolium pratense" Apple ¬∃hasP art.AppleP eel "apple (whole or parts) and not something that has part some apple peel" ples of verbalised concepts from the constructed Complex SI datasets in Table 6.

E Implementation Choices for Assumed Disjointness
As mentioned in Section 3.1, validating the disjointness axiom for each concept pair (C, D) we have sampled as a potential negative subsumption would be time-consuming because we need to iteratively add the disjointness axiom into the ontology O, conduct reasoning, and remove the axiom afterwards. Therefore, in practice we can use the following conditions to replace the satisfiability check: If C and D satisfy these two conditions, they are likely to be satisfiable after adding the disjointness axiom C D ⊥ into O. Since these two conditions involve no extra reasoning for a new axiom, they are much more efficient than iteratively conducting satisfiability check for candidates. It is important to notice that we still need the no common descendant check to prevent foreseeable unsatisfiability. This is because if there is a named atomic concept A that is an inferred sub-class (i.e., descendant) of C and D, then it is possible that C and D are satisfiable in O ∪ {C D ⊥}, but A is certainly unsatisfiable (equivalent to ⊥). Figure 4: Set-based semantics for relationships between two ontology concepts.

F Set-based Interpretations of Subsumption Samples
In this section, we provide more explanation for how we define positive and negative samples in the Subsumption Inference (SI) task. Recall the definitions in Section 2.1, an ontology O entails a subsumption axiom C D if it holds for every interpretation I of O. In terms of setbased semantics, this refers to case (a) in Figure 4. In the (b), (c), or (d) cases, there exists at least one interpretation I, such that we can find an individual x that x I ∈ C I and x I ∈ D I ; hence O does not entail the subsumption axiom C D. Nonsubsumption is entailed only when (a) does not hold for every interpretation of O.
Disjointness corresponds to (c) in Figure 4 where the set of C and the set of D have no overlap for every interpretation. Non-subsumptions an ontology typically entails come from the disjointness axioms (but disjointness ∀x.C(x) → ¬D(x) is stricter than non-subsumption ∃x.C(x) ∧ ¬D(x)). Nevertheless, ontologies are typically underspecified in terms of disjointness, and thus getting    Table 7: Full results of roberta-base and roberta-large on the biMNLI, Atomic SI, and Complex SI tasks with each cell stating "mean accuracy (standard deviation)" except for the majority vote and fully supervised settings where standard deviation is not available. enough negative samples is unfeasible. To find a middle ground, it is reasonable to adopt heuristics. The assumed disjointness we follow in Section 3.1 in the main body of the paper serves this purpose.
In the ideal setting where we check the satisfiability of C and D after adding the disjointness axiom and no common descendant of C and D, cases (a) and (b) in Figure 4 will be prevented and the chance of (d) reduced. Even in the practical alternative proposed in this Appendix E, the no subsumption relationship condition also ensures that (a) and (b) are not entailed and the no common descendant and no common instance conditions reduce the chance of (d). Thus, the assumed disjointness is a reasonable approach to approximate non-subsumptions.

G Complementary Results and Figures
In the main body of the paper, we report partial results (accuracy scores and standard deviations) of roberta-large and roberta-base for K ∈ {0, 4, 32, 128}. In Table 7, we present full results of both LMs for K ∈ {0, 4, 8, 16, 32, 64, 128}. Besides, we provide the visualisation of K-shot results for roberta-base in Figure 5. The observations are consistent with those for roberta-large in Figure 3.