MIST: a Large-Scale Annotated Resource and Neural Models for Functions of Modal Verbs in English Scientific Text

Modal verbs (e.g.,"can","should", or"must") occur highly frequently in scientific articles. Decoding their function is not straightforward: they are often used for hedging, but they may also denote abilities and restrictions. Understanding their meaning is important for various NLP tasks such as writing assistance or accurate information extraction from scientific text. To foster research on the usage of modals in this genre, we introduce the MIST (Modals In Scientific Text) dataset, which contains 3737 modal instances in five scientific domains annotated for their semantic, pragmatic, or rhetorical function. We systematically evaluate a set of competitive neural architectures on MIST. Transfer experiments reveal that leveraging non-scientific data is of limited benefit for modeling the distinctions in MIST. Our corpus analysis provides evidence that scientific communities differ in their usage of modal verbs, yet, classifiers trained on scientific data generalize to some extent to unseen scientific domains.


Introduction
Each year, an estimate of 1.5 million scientific articles are published (Knoth et al., 2020); hence, the construction of knowledge graphs (KGs) from scholarly texts for aggregating and navigating research findings is an active research area (Chandrasekaran et al., 2020;Knoth et al., 2020;Nastase et al., 2019;Demner-Fushman et al., 2019, 2020).Professional academic writing makes ample use of hedges, linguistic devices indicating uncertainty, because scientific propositions are usually considered as valid only until they are overwritten by newer findings (Hyland, 1998).Distinguishing valid solutions to problems from unverified and/or potential solutions is a crucial step in information extraction (IE) from scientific text (Heffernan and Teufel, 2018) as KGs should at least mark untested hypotheses as such (see Figure 1).Yet, with the no- table exception of BioScope (Szarvas et al., 2008), prior work in this area is limited.
In this paper, we focus on modal verbs, a frequently used device for signaling hedging in academic discourse (Hanania and Akhtar, 1985;Getkham, 2011).Other functions of modals include indicating abilities or restrictions.Their meaning depends on the sociopragmatic context (Yamazaki, 2001), i.e., here on the conventions of the community of a particular academic field.Successful academic writing requires correct community-specific use.As shown in Figure 1, understanding the different notions has relevance to KG population (e.g., Luan et al., 2018;Friedrich et al., 2020).Computational modeling of the functions of modal verbs also has applications in language learning and writing assistance software (Römer, 2004).
Prior work in computational linguistics targeting modal verbs (Ruppenhofer and Rehbein, 2012;Rubinstein et al., 2013;Pyatkin et al., 2021;Marasović et al., 2016) has primarily worked with data from the news domain.The annotation schemes of these datasets largely follow distinctions estab-lished in the linguistic literature (see, e.g., Kratzer, 1981;Palmer, 2001;Von Fintel, 2006;Portner, 2009), differentiating between the following coarsegrained modal senses: (a) epistemic expresses judgments about the factual status of a proposition, (b) deontic relates to permission, obligation, and requirements, and (c) dynamic refers to internal abilities or conditions.Our work differs from all of these approaches (a) in that we are the first to address the domain of scientific writing, and (b) in that we do not primarily study modal senses, but instead focus on the pragmatic function of modal verbs, i.e., our aim is to capture an author's reason for using a particular modal verb in a context.
With this paper, we release MIST (Modals In Scientific Text), a manually annotated corpus for investigating the usage of modal verbs in scientific text.Our multi-label annotation scheme for modal functions covers semantic, pragmatic, and rhetorical reasons for an author's use of a modal, with a focus on sub-distinctions that are crucial from an IE viewpoint.MIST consists of 3737 annotated modal verb instances selected from texts of five scientific disciplines (henceforth domains), which is larger than all existing comparable datasets (see Table 1).Our corpus analysis reveals differences in modal use between scientific domains, and between academic and non-academic use.We perform an inter-annotator agreement study and ensure high data quality via adjudication.
Based on MIST, as well as related corpora, we conduct an extensive computational study on automatically classifying functions of modals, comparing CNN-based (Marasović and Frank, 2016) and BERT-based models (similar to Pyatkin et al., 2021).In contrast to prior modeling work, we circumvent modifying the transformer's input by selecting the modal's contextualized output embedding and/or the CLS embedding as input to the classifier.We find that in most cases, a model using both embeddings works best.
To sum up, our paper lays the groundwork for both corpus-linguistic and computational work on modeling functions of modal verbs in scientific text.Our contributions are as follows.
• Our new large-scale dataset annotated with functions of modals in scientific text is publicly available. 1• We conduct an in-depth corpus study detailing the corpus construction process, agreement, and 1 github.com/boschresearch/mist_emnlp_findings2022corpus statistics, as well as a comparison with existing schemes (Sec.3).• Our computational experiments provide a systematic comparison of neural models for modal classification on scientific text (Sec.4 and 5).We find that a combination of the CLS embedding and the embedding of the modal verb itself works best.• We show that models trained on out-of-genre data do not work well on scientific text, while classifiers trained on annotated scientific text perform well on unseen scientific domains.In sum, these experimental findings underline the value of our new dataset.
2 Related work  (Ide et al., 2008), covering several domains.They also introduce the Modalia version Modalia M using this 3-way scheme, mapping conditional and concessive to epistemic and optative to deontic.Finally, their EPOS dataset consists of 7693 sentences for which the same 3-way annotation has been derived via cross-linguistic projection from Europarl (Koehn, 2005) and OpenSubtitles (Tiedemann, 2012).King and Morante (2020) annotate modal verbs in the vaccination debate domain (VCM).Several annotated datasets target modal expressions in a variety of domains, e.g., focusing on could (Moon et al., 2016) in English GigaWord (Parker et al., 2009), or negotiation dialogues (Lapina and Petukhova, 2017).We are also aware of a cluster of works on annotating and tagging Portuguese data (EP) using multi-genre data and RR12-style annotation schemes (e.g., Mendes et al., 2016;Avila and Mello, 2013;Quaresma et al., 2014b).Cui and Chi (2013) conduct a small annotation study on Chinese modals with Rubin13-style labels (CuiChi13).Yamazaki ( 2001) performs a corpus study on how American English native speakers interpret modal verbs in the chemistry domain.
Modeling.Early approaches to modal sense classification leverage a lexicon (Baker et al., 2010), or make use of "traditional" features (such as ngrams or part-of-speech tags) and maximum entropy classifiers (Ruppenhofer and Rehbein, 2012;Zhou et al., 2015) or SVMs (Quaresma et al., 2014a,b).Li et al. (2019) create context vectors for modals by computing weighted sums of the non-contextualized word embeddings of selected context words.Marasović and Frank (2016, henceforth MF16) generate a sentence embedding using a CNN, hence classifying sentences instead of modal instances.Our models are most similar to those of Pyatkin21, who encode input sentences using RoBERTa (Liu et al., 2019), with the CLS embedding as input for a linear classifier.Their model variants differ in the input: the Context model marks the modal trigger with special tags (Sue <target>can</target> swim); the Trigger+Head model encodes only the trigger and its dependency head without further context.Further related work.Other related work in-cludes research on speculation in biomedical data (Szarvas et al., 2008;Kim et al., 2011) and on event factuality (e.g., Saurí and Pustejovsky, 2009;Stanovsky et al., 2017;Rudinger et al., 2018;Pouran Ben Veyseh et al., 2019).Bijl de Vroe et al. ( 2021) integrate a lexicon-based method for modality detection in event extraction; using this tagger, Guillou et al. (2021) find that entailment graph construction does not profit from tagging for modality.Vigus et al. (2019) propose to annotate modal structures as dependencies.Rhetorical analysis of scientific text is often based on Argumentative Zoning (Teufel et al., 1999).Lauscher et al. (2018a,b) provide a dataset and neural methods for extracting and classifying claims from scientific text.Luan et al. (2018), Jiang et al. (2019), andFriedrich et al. (2020) present data-driven work on scientific IE.Heffernan (2021) uses modality as a feature to recognize problem-solving utterances in scientific text.

MIST Corpus
In this section, we describe our new dataset, including its annotation scheme and detailed corpus and inter-annotator agreement statistics.We annotate instances of can, could, may, might, must, and should in research papers from five scientific fields: computational linguistics (CL), materials science (MS), agriculture (AGR), earth science (ES), and computer science (CS).Modal usage is influenced by sociopragmatic context (Yamazaki, 2001) and, as a form of hedging, needs to be understood in its social, cultural and institutional context (Hyland, 1998), here the global scientific community.Hence, we do not restrict document selection to native English authors.

Document and Sentence Selection
We select modal verb occurrences as follows.In our full-text subset of 73 documents, the CL papers are taken from the ACL Anthology,2 spanning the years 2013-2015.Data from the other domains stems from the OA-STM corpus,3 with the exception of five open-access documents for MS.
Because some modal-domain combinations are rare, we additionally sample sentences from 348 documents with Creative Commons licenses such that we have at least 100 instances for each modaldomain pair.For CS, we sample papers tagged  with cs:CV and published in 2018 from ArXiv. 4  Additional MS papers published between 2015 and 2021 were retrieved via PubMedCentral. 5For ES and AGR, we use the DOAJ API 6 to retrieve documents matching the topics of the full-text subset.
For AGR, we add articles from the Journal of Agricultural Science published 2017-2021. 7In total, we obtain a large-scale dataset of 3737 annotated instances (see Table 2, complete corpus).

Annotation Scheme
Our annotation scheme comprises seven labels for functions of modals (see Table 3).scientific discourse.Table 3 classifies a set of utterances according to our, RR12's and Rubin13's schemes. 8A detailed description of the commonalities and differences is provided in Appendix A. During annotation scheme design, we started out with their categories, but then tailored our scheme to the scientific domain, adding some pragmatic distinctions that are relevant in scientific writing.Annotators have access to the full documents.For labels involving inference, uncertainty or speculation, annotators are instructed to only refer to the text and not to make use of their own knowledge of whether something is the case.

Annotation Process
Our annotation scheme takes a multi-label approach in which all applicable features may be selected.For each instance of the full-text subset, we collect the annotations of three annotators (two for MS) using the web-based annotation systems Swan (Gühring et al., 2016) and INCEpTION (Klie et al., 2018).We ensure consistency across sub-corpora by means of an adjudication step (for all instances) performed by one author of this paper, who then also labeled the additionally sampled instances.Our total group of annotators consists of one undergraduate as well as three graduate students of CL, one undergraduate student of CS, one graduate student of MS and one physicist holding a PhD degree.While not all annotators are native speakers of English, they are either domain experts or have a strong linguistic background.

Corpus Analysis
Modal distributions.We first analyze the usage of the different modals per domain.As shown in Table 2, in the full-text subset, the ratio of sentences including modal verbs ranges from 9.0% (AGR) to 14.2% (CS).In Figure 2, we plot the distribu-  tions of modals by domain.Except in the case of ES, can is the most frequently used modal by a large margin.In AGR and ES, may is also used frequently.Overall, the distributions of CL, CS and MS are somewhat similar, while AGR and ES exhibit different modal usage patterns.The distributions differ from modal usage in other genres (details for MASC and Modalia see Appendix B.2), e.g., the percentage of can is much higher in MIST.
Label distributions.Next, we drill down on the functions of the modals by domain.If an instance has more than one label, both labels are counted.
The label distributions differ strongly by modal (see Figure 3 and Table 4), but at times also visibly between domains.Previous corpus-linguistic studies (Takimoto, 2015;Hardjanto, 2016) observe more hedging in humanities and social sciences text compared to the natural sciences.ES, which deals with earth's present features and its past evolution, has notably more inference usages of must and should.
In MS, many cases of could are classified as feasibility, as it is common to report experiments in the past tense in this domain.Also, in MS may is sometimes used interchangeably with can as in "stress-strain data may be obtained for ductile mate-rials."The larger amount of rhetorical instances in MS is due to cases such as "We should note that." Comparing the label distributions of MIST and those of MASC and Modalia M , we also find notable differences (details in Appendix B.2).For example, may is used mostly in epistemic senses.Our annotations reveal that in AGR and ES, these are mostly speculation; CL and CS texts use this modal to indicate (mostly algorithmic) options.Finally, the use of should seems most community-specific: while it is used predominantly in a deontic way in MASC and Modalia M , usage in MIST varies by domain.Overall, these observations support the hypothesis that modal usage depends on the sociopragmatic context, and demonstrate the value of genre-specific data such as MIST.
Label co-occurrence.In the full-text subset and in the complete corpus 24.5% and 22.3% of instances carry more than one label, respectively.Figure 4 shows the total number of label co-occurrences in the adjudicated gold standard.Overall, speculation co-occurs most with other labels, indicating that the author likely had two reasons for using the modal, for example indicating a capability, but marking at the same time that it is unclear whether it actually  holds ("The urban ecosystems could account for a significant portion of terrestrial carbon (C) storage (...).").Often, both a feasibility and a capability reading are possible (see lower part of Table 3), as in "The above construction can be further simplified.",where simplifiability is an intrinsic property of the construction, but the simplification needs an external actor.

Inter-Annotator Agreement
Computing agreement for our dataset is not straightforward for two reasons.First, we are dealing with a multi-label scenario, for which standard agreement coefficients cannot easily be applied.Second, for some modal-domain combinations, we only have limited data.Averaging across modal verbs is not meaningful: due to the notably different label distributions, good agreement could only mean that annotators distinguish modal verbs well (Artstein et al., 2009;Artstein, 2017).Following the idea of Krippendorff's diagnostics (Krippendorff, 1980), we evaluate (on the fulltext subset) for each modal-label combination how often annotators agree on whether the label applies or not.For each pair of annotators, we compute κ (Cohen, 1960) for this binary decision for each label, mapping all respective other labels to OTHER.In Figure 5, we report the average of these κ-scores over the pairs of annotators for each valid modal-label combination.For some combinations, high agreement is reached.For infrequent labels or modals, agreement is less satisfying.Many "disagreements" occur in cases where in fact several readings are possible.Qualitative analysis revealed that some annotators over-or under-used some labels, especially uncertainty, which in the initial round of annotation described here was defined to include both options and speculation.We hence decided to ensure high quality of our corpus through an adjudication step.In 62.2% of instances, the adjudicator's labels exactly match the majority vote across annotators; in 90.5%, they overlap with the majority vote labels.We further introduced the label options, and two adjudicators re-labeled all instances initially labeled with speculation.Out of these, both labeled 166 instances, reaching F1-agreements of 72.7/81.3/83.5/86.9 for capability, feasibility, options and speculation, respectively.In the remainder of this paper, we perform experiments based on the adjudicators' labels.

Computational Modeling
We now describe our neural models for classifying functions of modal verbs.We assume that targets have been pre-defined, e.g., using a part-of-speech tagger.Our models are based on a pre-trained transformer that provides embeddings for sentences and contextualized token embeddings.We fine-tune SciBERT (SB, Beltagy et al., 2019), which has the same architecture as BERT (Devlin et al., 2019), but has been trained on large volumes of scientific text.On top, we use multiple classification heads, i.e., one per modal, as the label distributions vary substantially by modal.The largest version of our models is trained jointly on multiple datasets and therefore has the aforementioned output heads for each dataset (see Figure 6).The output dimension of these heads varies according to the labelset size of the respective dataset.We test the following model variants: SBCLS.We feed the CLS embedding of an input sentence into a linear layer with softmax (for singlelabel classification) or sigmoid (for multi-label classification) activation.This model uses the same decision basis for all modal verbs within a single sentence.SBmodal.We select the embedding of the word-piece token corresponding to the modal to be classified (modal embedding),9 and feed this embedding into the linear layer as above.We expect this model to be able to distinguish different modal verbs in the same sentence.The model primarily reflects local context, but to some extent also dependency context (Tenney et al., 2019).SBCLS,modal.We concatenate the CLS embedding with the modal embedding before feeding it into the linear layer.This model should distinguish modal verbs in the same sentence, at the same time leveraging the CLS embedding that intends to cover the entire sentence.

Experiments
In this section, we report our experimental results.

Evaluation Metrics
As majority classifiers are known to provide a strong baseline for modal sense classification (see Rubin13, MF16), we report F 1 scores in order to evaluate how well a classifier performs across labels.We compute macro-average F 1 (mF 1 ) as the average of the per-label F 1 scores for the set of labels with which the modal is labeled at least once in the entire corpus and which are not omitted from the experiments due to extreme sparsity (see Table 4).We also report accuracy; we compute it globally across samples and labels, i.e., we simply count for each label how often the classifier (in)correctly did (not) assign it.For hyperparameter tuning and early stopping, we use the macroaverage of weighted F 1 scores for each modaldomain combination.These weighted F 1 scores are computed by weighting per-label F 1 scores by the label's support in the validation set.For computing all metrics, we use TorchMetrics.10

Baselines
We report results for the following baselines: Maj always predicts the label most frequent in training.We also re-implement MF16's CNN with 300dimensional GloVe embeddings (Pennington et al., 2014) and filter region sizes of 3, 4, and 5 with 100 filters each.Replicating MF16's Table 4 (with their hyperparameters and training a separate model for each modal), we find that our CNN implementation is comparable to theirs, with 77.6 % accuracy on all verbs compared to MF16's 76.5%.On MIST, we use only one model with per-modal heads.SBCLSmark is our re-implementation of Pyatkin21's Context model (their most accurate model), but using SciBERT and per-modal heads.We also investigate whether the genre-specific pre-training is beneficial, replacing SciBERT with BERT (BERTCLS,modal), and how the model size affects performance, comparing to BERT-largeCLS,modal (to date, there is no SciBERT-large).

Experimental Settings
We randomly split MIST into a training and a test set of complete documents, aiming at covering approximately 25% of each domain's modal instances in the test set, with real test set sizes ranging from 22.8% to 27%.In our CV training setting, we split the training set into 5 folds of complete documents, and train 5 models on 4 folds each, using the respective fifth fold for model selection.We train for at most 100 epochs, performing early stopping with a patience of 10 epochs.We then run each of these five models on the unseen test set, reporting average scores along with standard deviations.Hyperparameters are reported in Appendix D.1.

Experimental Results on MIST
Here, we evaluate the neural architectures described above on MIST, and investigate performance in the absence of in-domain training data.
Comparing Model Architectures. on MIST.The magnitude of these scores differs by modal verb.The CNN learns more than Maj., but is always outperformed by the SciBERT-based models.SBmodal is better than SBCLS on can, might, and should, but worse on must, where using an additional sentence-wide embedding is beneficial.For most of the verbs, SBCLS,modal is the best SciBERTbased model, but SBmodal is better on may and might.
In general, SBCLS, modal tends to have smaller standard deviations across CV training configurations than the other SciBERT-based models.On could and must, SBCLS,modal is better than SBCLS-mark, suggesting that directly using the modal's embedding instead of modifying the input is more effective.
On most verbs, SciBERT and BERT perform comparably, but the domain specificity of SciBERT leads to clear improvements on can and must.Interestingly, increasing the model size for BERT is beneficial on the very same verbs; at the same time, however, it hurts performance on the other verbs, with an especially sharp loss on could.
With the exception of can, SBCLS,modal is also the most accurate model (scores in Appendix D.3).For this model, during development, we experimented with using only one classifier head for all modals (not reported in tables).Compared to per-modal heads, we observed either no difference or slightly worse (by around 1 point mF 1 on average) performance for all modals except must, where mF 1 increased by around 15 points.These gains were due to similar rhetorical instances, e.g., "We must note that..." and "We should note that...".Cross-Domain Results on MIST.We conduct a cross-domain experiment on MIST to determine the extent to which in-domain training data is necessary for classifying modal verbs in different scientific communities.Since some modal-domain combinations have rather little data, in this experiment, we split MIST into six folds and use each fold once for testing.We use four of the remaining five folds for training and one for early stopping.Table 6 reports the cross-validated averages and standard deviations of averages of per-modal mF 1 and accuracy to show the overall effect of indomain data.Models trained on other (scientific) domains work well on unseen domains, as the performance does not decrease substantially when training without in-domain data.As one would expect, domain-specific data usually leads to improvements, especially for domains in which a specific modal has a visibly different label distribution (see Figure 3), e.g., cross-validated mF 1 for could on CL increased by around 18 points.For other modaldomain combinations, gains were less distinct or sometimes non-existent, and cross-validated scores had a high variance.On average, standard deviation of accuracy was 2.5 and 2.7 for with and without in-domain data, respectively.For mF 1 , standard deviation was 10.7 when training with in-domain data and 11.3 when training without in-domain data.
In sum, we expect classifiers trained on MIST to also generalize to new scientific domains to some extent.For optimal performance, adding in-domain data is beneficial in most cases.

Transfer from GME to MIST
In this experiment, we show that functions of modal verbs in scientific text cannot be determined simply using existing datasets.We train a model only on an out-of-genre resource (GME in the version  We train and evaluate all models in this experiment using the mapped annotation scheme, using the SBCLS,modal architecture with sigmoid heads.For hyperparameter tuning and evaluation, we perform the steps as described in Appendix D on GME T with five randomly induced folds (SBCLS,modal; GME T ).SBCLS,modal; MIST is SBCLS,modal trained on MIST with mapped labels.SBCLS,modal; MIST-small is trained on a randomly downsampled version of MIST to account for the notably larger size of MIST compared to GME T .We approximately proportionally randomly downsample each MIST fold (with mapped labels) to get MIST-small, which has exactly the same number of instances as GME T .
Table 9 shows the results of our transfer experiment.SBCLS,modal; GME T learns more than just the majority baseline MajGME T (except for might, for which GME contains little data), but clearly lags behind the models trained on MIST in both mF 1 11 We thank the anonymous reviewers for proposing this interesting experiment.and accuracy, with average mF 1 being between 23.9 and 37.1 points lower than models trained on MIST-small.A related experiment (reported in Appendix D.4) using prior corpora of annotated modals in multi-task objectives confirmed the limited amount of transferability.Hence, genrespecific data is clearly required for classifying functions of modal verbs in scientific discourse, demonstrating the value of MIST.

Conclusion and Outlook
In this paper, we have introduced a new large-scale dataset of scientific text annotated for functions of modal verbs.Our corpus and computational studies reveal differences and similarities in modal usage across genres and domains.We have shown that neural classification is robust across scientific domains, but also that annotated scientific text is essential for good performance.To sum up, our paper lays the groundwork for informed IE from sentences containing modals in scientific texts, e.g., distinguishing speculations from capabilities attributed to a method or device.Future work.Our experiments on MIST point to various next steps, e.g., identifying domainadaptation methods that more effectively leverage annotations across domains, genres, or even languages, or developing data-augmentation techniques targeted to scientific text.Another next step is to integrate our methods for generating metadata for facts into IE systems.Existing open IE systems do not handle the meaning of modal verbs adequately.As an outlook, in Appendix E, we outline how this could be improved using a classifier trained on MIST.Daria Stepanova for a very helpful discussion on knowledge representation.We thank our annotators Sherry Tan, Prisca Piccirilli, Johannes Hingerl, Federico Tomazic, and Anika Maruscyk for their dedication to the project and insightful discussions.We also thank Josef Ruppenhofer for answering our questions on the Modalia annotation scheme, and Valentina Pyatkin and Shoval Sadde for answering our clarification questions on their models and experiments.

Limitations
Closed class of targets.Our work is limited to a closed class of linguistic expressions (modal verbs).Such approaches are sometimes seen as "too narrow" to be of interest to the NLP community.However, we argue that examining components of language understanding in detail will ultimately point to relevant research directions.In addition, as we have shown, modal verbs are a very common phenomenon, occurring in about every tenth sentence in scientific text.Nevertheless, we admit that a limitation of our study is the focus on a closed set of verbs in the English language.Future work might generalize our ideas to a more open class of targets (yet, it is a challenge to come up with a well-defined selection).
Limited data for minority classes.For some categories, data is limited due to the difficulty of data collection (we can only sample for modal verbs, not for labels).We have already enriched the dataset by a second annotation round, further data collection is unfortunately infeasible in the context of our project.
Applications.Our study provides the first steps (an annotated dataset, a corpus-linguistic study and NLP models) of research into the computational modeling of modal verbs in scientific text.Our distinctions intuitively should be of high relevance to processing and mining scientific text.Besides the case study on why the distinctions matter for open information extraction (IE) and practical suggestions for incorporating them into existing Open IE systems in Appendix E, demonstrating the usefulness of our work on existing scientific relation extraction datasets (which unfortunately do not commonly mark the "information status" of the annotated relations) is beyond the scope of this paper (but planned future work).

A Comparison to Existing Annotation Schemes for Modal Senses
Table 3 classifies a set of utterances according to our, RR12's and Rubin13's schemes (according to our interpretation of their guidelines).12These two works inspired ours, but with the aim of knowledge graph construction in mind, we tailored an annotation scheme making explicit the various pragmatic and rhetorical reasons for using modals in scientific writing.We thereby follow Moon et al. (2016), who argue that "not everything about modal auxiliary meaning can be represented at once," and that "it is important to focus on the parts of modal auxiliary meaning that most directly impact an automated learner."While we fully agree with the linguistic classification of the examples by RR12 and Rubin13, we found certain sub-distinctions to be essential for understanding modal usage in the scientific context, and designed our annotation scheme for functions of modals accordingly, intentionally conflating what is traditionally treated separately as modal sense disambiguation and veridicity (Karttunen and Zaenen, 2005) from the author's point of view.
The definition of Rubin13's label Circumstantial, focusing less on dispositions rather than on abilities in particular circumstances (Von Fintel, 2006), is closer to our feasibility, which could be interpreted as an ability of the actor given the circumstances (but sometimes overlaps with internal properties of the object under discussion).Conversely, we do not distinguish personal wishes and goals as in Rubin13.The label deontic for can of RR12 falls under our label options if options are introduced, and maps to our deontic otherwise.Within the epistemic notion, we further distinguish whether a statement is derived from other facts (inference) or whether an author speculates (both labels may apply at the same time).As some usages of modal verbs in scientific writing are rather conventional, we introduce the label rhetorical.

B Further Corpus Statistics B.1 Impact of Negation
Analyzing all negated modal verb constructions, we found only two instances where negation affects the modality label.For example, "Submarine volcanism alone cannot be the sole driving mechanism for OAEs" is labeled with capability when ignoring the negation.Otherwise, this becomes an inference.

B.2 Comparison of Label Distributions of MIST, MASC, and Modalia M
The distribution of modal functions and sense differs between corpora and genres (academic writing vs. news).Comparing Figure 3 and Figure 7, we note several differences.The most frequent modal in all genres is can, but it is much more frequent in CL, CS, and MS.For can and could, dynamic/feasibility/capability uses are predominant, with the exception of Modalia M , where the majority class of could is epistemic.Can and could are not used in the deontic sense in MIST; their epistemic uses are all related to speculation.

C Annotation Guidelines
In this section, we describe our annotation guidelines for marking up modal verbs in scientific publications with regard to whether they are used for particular rhetorical, semantic or pragmatic reasons as they were presented to the annotators.Depending on the context, modal verbs can modify a sentence's propositional content such that uncertainty about the truth of the proposition is implied (e.g., "X is the cause for Y" vs. "X may be the cause for Y"), but in other circumstances, they simply indicate properties or capabilities (e.g., "X can float").
Our goal is to provide information about the functions of modal verbs in our corpus that then can be used in a preprocessing step for information extraction.For example, when disregarding a modal's contribution to the discourse, when processing "X can float", the relation float(X) may be extracted, but it should be flagged somehow as the sentence does not state that X is currently floating or that it always floats.In contrast, adding has_capability(X, float) to our knowledge base is desirable.We consider can, could, must, should, may, might as well as their negated forms for annotation.They are pre-marked in the corpus to ensure that no modal verb is overlooked.Our annotation scheme is based on the observation that it is not always possible to assign exactly one type to every instance.We decided to follow a feature-based annotation approach in which a modal verb is represented by features that do or do not apply.Our feature sets reflects the range of functions a modal verb can speculation: speculation is used when the truth value of an utterance is not clear according to the author.Note that we annotate this feature only in cases where feasibility or capability are not clearly the predominant readings, and use both features only if a speculation reading is really predominant.Example 6.This problem might be mitigated by using better semantic-based retrieval model.
Here, we label both feasibility and speculation.
Consider replacing might with can: then, the feasibility is clear, but no speculation is involved, which is the author's reason for choosing might instead.options: options is marked up when the author uses the modal verb to enumerate some potential options.
Example 7. The real shielding can of course be different.
A different shielding may be used; potentially the shielding is different, but it can also stay the same.Note that "being different" is not a property, hence capability or feasibility wouldn't fit here.
Breakfast, pancakes and hashbrowns are options for w1, w2 and w4.
Example 9.This process can last from several hours to a few days depending on the applied temperature.
The reason for using the modal verb here is mostly to convey the uncertainty about the duration, it does not describe a capability of the process.
Example 10.We showed that combining a model based on minimal units with phrase-based decoding can improve both search accuracy and translation quality.
In this case, we label both capability and options, as the sentence both indicates a capability of the combination method, but at the same time could be read as a hedging term (i.e., improvements occur only in certain circumstances).deontic: deontic is selected if the author uses the modal verb to express a desire, i.e., how the world should be like, to express a requirement for something, e.g., an experiment, or to state an obligation.
Example 11.A value is defined as (first part of the definition) and also as (second part of the definition).(In this example, speculation and feasibility are also applicable.)Other: This label is used if none of the above features apply.Please extract those sentences and explain why you couldn't decide for a predefined feature.Also think about whether you have a tendency towards one or more features but there is something that has to be captured in our scheme in additional.This was used during annotation scheme development.

C.2 Additional examples: feasibility vs. capability
As stated above, features are not mutually exclusive.Sometimes, multiple readings/interpretations may be possible.Under certain circumstances annotators are asked to select multiple features.In this section, we show some not-so-clear-cut examples to complement the above guidelines, which work with mostly clear examples.
If we want to annotate feasibility or capability but it is hard to decide which of both features apply, we follow the following guidelines.
We annotate both features when there is an external actor (e.g., a human) involved, but if it can also be interpreted as describing a particular internal property of the referent of the subject.The referent has this property already before an external actor is involved.
Example 16.This simpler distribution Q can be viewed as an approximation to P. feasibility: A human agent views Q as an approximation to P. capability: Without a human agent viewing Q, Q is still an approximation to P, P has the property of being an approximation to P in general without somebody actually viewing it.
Example 17.For instance, despite graphene, the band gaps of silicone can be opened and tuned when exposed to an external electric field.feasibility:A human agent opens the band gaps of silicone.capability: some materials have the property of having openable band gaps, it is always possible to open band gaps of silicone under this circumstances.
Whenever there is a human actor involved, we mark up feasibility even if the sentence includes a passive construction which could indicate a capability.capability and feasibility are only used at the same time if the modal verb is used to signal an intrinsic property (band gaps of silicone can be opened) and an external actor is involved.We do not mark up capability if feasibility applies but there isn't a general property.In this case an external actor has to do something first.As a consequence, some entity has a capability.
Example 18.The resulting expression combines similarity terms which can be divided into two groups.feasibility: An human actor is needed to divide the terms into groups.Being dividable is not an intrinsic, common property of these terms.feasibility is the strongest modal function in this utterance.The modal verb is not used to convey an information about a capability, as it is not an intrinsic property of similarity terms that they can be divided (we consider this to be an artifact of their being grouped).
Example  When it is clearly possible for an entity to have a property but this doesn't apply in general, we still use capability, but possibly speculation in addition.
Example 21. graphene aerogels with ... can present superelasticity.capability: some of these aerogels have this property, it is possible for aerogels to have this property.speculation: It is uncertain whether each aerogel has this property, only some of them may present superelasticity, or aerogels have this property only under particular circumstances.

D Experimental Studies D.1 Hyperparameters
This section describes the hyperparameter tuning for our main experiments.For CNN and SB, we tune learning rates, batch sizes, dropout probabilities (only CNN) and learning rate warm-up lengths (only SB) using grid search on the values shown in use the respective remaining fold (validation fold) for model selection.For each of the five models, we average weighted F 1 scores (see Sec. 5.1) on the validation fold across modal verbs.We then choose the hyperparameter setting that performs best on average across the different models.The tuned batch sizes and learning rates are 32 and 5 −3 (CNN), and 8 and 3 −5 (SB).SB is warmed up for 2 epochs.We use a dropout probability of 0.1 in the output heads, and the Adam optimizer (Kingma and Ba, 2014) with a weight decay of 1 −3 (CNN) respectively 0 (SB).

D.2 Training Details, Model Size, etc.
All experiments were performed on a single Nvidia Tesla V100 GPU.Training and testing the SBCLS,modal models in the 5-fold CV training setting used in the model architecture comparison experiment (cf.Table 5) took 1.2 hours (for the entire experiment).
SciBERT has the same number of parameters as BERT-base, i.e., 110M.The linear layer we add on top of SciBERT in the SBCLS,modal has less than 11k parameters.

D.3 Further Experiment Results
This section provides further experimental results, elaborating on the study described in Sec.5.4.
Table 11 provides accuracy scores for the models whose F 1 scores are reported in Table 5.

D.4 Cross-Genre Multi-Tasking Experiment
We investigate whether we can improve classification on MIST by using existing modal sense classification datasets as auxiliary tasks in training.The only verb where co-training leads to clear improvements is may.Here, it increases per-label F 1 scores (not reported in tables) for spec., opt., feas.(for the latter two except for GME), and cap.(except for Modalia M ).For the other verbs, classification performance is similar (e.g., should) or decreased (e.g., might, which is only covered by GME, but just with few instances).Thus, in line with the findings from the pure transfer experiment, using modal sense information from out-of-genre datasets for classifying modal verbs in scientific text is non-trivial.

E Case Study: Treatment of Modality in Open Information Extraction
We now discuss how handling modal verbs in Open Information Extraction (OIE) systems may be improved using our classification scheme by adding interpretations instead of just pin-pointing modal verbs.The same principles can be applied to relation extraction settings with predefined schemas when relations are rooted in a verbal argument structure.

E.1 Analysis of Existing OIE Systems
We run four popular recent OIE systems on sentences from MIST and perform a qualitative analysis of the results.We find that the examined systems either have no specific mechanism for handling modality, or include modality information only in rather rudimentary ways.OpenIE414 (Christensen et al., 2011;Pal and Mausam, 2016) and OpenIE615 (Kolluru et al., 2020)  In sum, modals are extracted by all of these OIE systems, but their classification and interpretation is left to the downstream system.MinIE 17 (Gashteovski et al., 2017) includes a notion of modality by adding a binary modality value (certainty/possibility) to each extracted triple.In practice, we observe that the occurrence of virtually any modal in the input sentence results in the triple being assigned the possibility label.This means that sentences such as "X can influence Y," "X should influence Y," "X must influence Y," or "X may influence Y" are in effect all being treated as paraphrases.In sum, existing state-of-the-art OIE systems do not handle the meaning of modal verbs in a way that could inform downstream use.

E.2 Discussion: Modality-informed Open IE
In light of the weaknesses of existing systems, we now sketch an approach by which OIE systems could be extended to incorporate modality information, which could be generated by a classifier (as described in Sec. 4).As motivated by Figure 1, we posit that there are two main ways in which modality information should be incorporated into extractions.(For an overview, see also Table 13.)First, we propose specific relation templates for the capability and deontic modalities: hasCapa-bilityTo_* for the former and isRequiredTo_* and isAllowedTo_* for the latter.In a given extracted triple, these relation templates would be instantiated with the main verb of the extraction, e.g., "X can influence Y" (capability) would yield (X, has-CapabilityTo_influence, Y). 18  Second, to cover cases modifying not only the relation but the entire fact, we propose the metaproperty hasFactualityRating (see also Figure 1).This property could take the values speculation (for speculation), possible (for options and feasibility), 16 demo.allennlp.org/open-information-extraction 17github.com/uma-pi1/minie 18In an OWL-like ontology, these concretely instantiated predicates may then be considered subproperties of generic hasCapabilityTo / isRequiredTo / isAllowedTo properties.inferred (for inference), and true (for rhetorical and as the default value of the property).For example, the sentence "X might influence Y" (speculation) would yield (X, influence, Y) with hasFactual-ityRating(speculation), whereas "These sandstones may contain reworked material."(options), would lead to (sandstones, contain, reworked_material) with hasFactualityRating(possible).Similar approaches to handling veridicality of utterances have for instance been proposed by de Marneffe et al. (2012).
We argue that such an approach would constitute an improvement over existing ways of handling modality in OIE.Enabling the identification across surface representations is one aim of OIE systems.Looking further ahead, explicitly disambiguating modal verbs as well as other constructions expressing the same meaning will result in a uniform representation.For example, X can (capability) influence Y) and X is able to influence Y would both be retrieved by searching for hasCapability, and X must (deontic) Y and X has to Y would be retrieved when searching for isRequiredTo.In addition, has-FactualityRating properties of extracted triples will immediately clarify their factuality status, avoiding, e.g., erroneously taking speculation as fact.Taken together, we have outlined a way to take OIE systems to the next level with regard to the treatment of modal verbs.

Figure 1 :
Figure 1: Modal verbs perform various functions in scientific text affecting KG representations.

Figure 2 :
Figure 2: MIST: Distribution of modals by domain, computed over full-text annotation subset.

Figure 3 :
Figure 3: MIST: Label distributions by modal verb and scientific domain (adjudicated complete corpus).

Table 1 :
Datasets manually annotated with modal verb categories. *

Table 3 :
Several supercapacitors can be integrated and connected in series.feasibility dynamic Circum./ State of the World The device can light up a red light-emitting diode and works well.capability dynamic Ability / State of the Agent The overlap in the ranges [...] indicates that the sample must be inference epistemic Epistemic / State of Knowledge older than 50.70 Ma.The real shielding can of course be different.options deontic Circum./ State of the World DA3 may therefore indicate a continuation of high nutrient surface speculation epistemic Epistemic / State of Knowledge water with an elevated freshwater input.Energy storage devices should be able to endure high-level strains.deontic deontic Bouletic / Desires and Wishes A GCR proton [...] must have at least 150 MeV to reach the station.deontic deontic Teleological / Plans and Goals You must leave the lab tidy.deontic deontic Deontic / Rules and Norms It can be seen in Figure 1 that... rhetorical dynamic Ability / State of the Agent For instance, despite graphene, the band gaps of silicone can be opened and tuned when exposed to an external electric field.feas., cap.dynamic Circum./ State of the World These results suggest that epeiric seas [...] may have played an important role in the driving mechanism for OAE 2. inf., spec.epistemic Epistemic / State of Knowledge Long may she live!deontic optative Bouletic / Desires and Wishes MIST annotation scheme in comparison to those of RR12 and Rubin13/Pyatkin21.

Table 5 :
Table 5 reports the mF 1 scores of the various neural models Macro F 1 (mF 1 ) on test set of MIST.#inst.train refers to the entire training set.
feasibility, options State of the World capability, rhetorical State of the Agent speculation, inference State of the Knowledge

Table 8 :
Transfer experiment: Macro F 1 on mapped test set of MIST.
published by Pyatkin21).11We train on GME T , i.e., all instances from GME (including the test set) that cover MIST's set of modal verbs using mapped labels as shown Table7.Resolving GME's State of Knowledge into inference and speculation and State of the World into feasibility and options would require a manual re-annotation.We map deontic to Pyatkin21's supertype Priority.

Table 9 :
Transfer experiment: on mapped test set of MIST.
Sander Bijl de Vroe, Liane Guillou, Miloš Stanojević, Nick McKenna, and Mark Steedman.2021.Modality and negation in event extraction.In Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021), pages 31-42, Online.Association for Computational Linguistics.
Our daily life requires matchable energy storage devices, which should have the capability to endure high-level strains.It is desirable that energy storage devices have the capability to endure high-level strains.Example 12.A GCR proton at the maximum latitude of the ISS must have at least about 150 MeV to reach the Station.In some contexts, modal verbs are used because of conventions and there is no substantial semantic need for doing so.13We annotate this cases with rhetorical.Example 14.It can be seen in Figure1that... Can simply be stated at 'In Figure1you see ..." If annotators feel that feasibility or capability are also strongly present in such a case, they may select these features in addition.Example 15.Value: <first part of the def-inition> The value can also be described via <second part of the definition>.
19. Similar symmetry can be achieved

Table 10 :
Hyperparameter values searched during hyperparameter selection for CNN and SB.Word vectors can be trained directly on a new corpus.feasibility: It is possible for somebody to train some word vectors on a new corpus.Word vectors cannot be trained directly on a new corpus in general, not all word vectors are trainable on a new corpus, so we don't annotate capability.

Table 11 :
Table10as follows: Similar to cross validation (CV), for each hyperparameter configuration, we train five models on 4 folds each for 10 epochs and Accuracy on test set of MIST.Standard deviations are rather small, between 0 and 1.4.
Table 12 shows the results of co-training with Modalia M , MASC, EPOS, and GME (see Sec. 2), and the first three at once.On GME, we follow Pyatkin21's experiments and collapse De-sires+Wishes and Plans+Goals to a Intentional.

Table 12 :
Multi-task setup: Macro F 1 on test set of MIST when co-training with other corpora.E/M/Mo: EPOS, MASC and ModaliaM together.
(Stanovsky et al., 2018) sentences in the form of standard subject-relation-object triples, simply considering modals part of the predicate, e.g., a sentence such as "X may influence Y" yields the extraction (X; may influence; Y).RnnOIE 16(Stanovsky et al., 2018)generates a representation resembling Semantic Role Labeling (SRL), in which spans within the sentence are annotated to indicate the relationevoking verb and its respective arguments, e.g., [ARG0: X] [ARGM-MOD: may] [V: influence] [ARG1: Y].Within this representation, modal verbs are treated as a simple modifier of the relation verb (ARGM-MOD).
Modal function IE extraction(s)

Table 13 :
Mapping modal functions to Open IE extractions, *=modified main verb.