Transforming Term Extraction: Transformer-Based Approaches to Multilingual Term Extraction Across Domains

Automated Term Extraction (ATE), even though well-investigated, continues to be a challenging task. Approaches conventionally extract terms on corpus or document level and the beneﬁts of neural models still remain un-derexplored with very few exceptions. We introduce three transformer-based term extraction models operating on sentence level: a language model for token classiﬁcation, one for sequence classiﬁcation, and an innovative use of Neural Machine Translation (NMT), which learns to reduce sentences to terms. All three models are trained and tested on the dataset of the ATE challenge TermEval 2020 in English, French, and Dutch across four specialized domains. The two best performing approaches are also evaluated on the ACL RD-TEC 2.0 dataset. Our models outperform previous base-lines, one of which is BERT-based, by a substantial margin, with the token-classiﬁer language model performing best.


Introduction
Automated Term Extraction (ATE) aims at extracting terms, i.e., single-or multi-word sequences, from domain-specific text. ATE plays a role in many NLP tasks, such as information extraction, knowledge graph learning, and text summarization. In a corpus-level setting, methods range from frequency-based to utilizing Wikipedia links, where no single method has been found to perform consistently best across domains in English (Astrakhantsev, 2018). In document-level ATE, Key-ConceptRelatedness (Astrakhantsev, 2014), which relies on keyphrase extraction and semantic relatedness, outperforms other methods (Šajatović et al., 2019). The use of neural networks in these methods is mostly limited to generating embeddings.
A first use of BERT-based language models is documented by Hazem et al. (2020), the winning * * Equal contributions system of the recent ATE challenge TermEval 2020 (Rigouts Terryn et al., 2020) and the baseline for the proposed approaches. Inspired by this first success of transformer-based models, we compare two variations of the multilingual pretrained language model XLM-RoBERTa (XLM-R) (Conneau et al., 2020) with an innovative use of the multilingual pretrained NMT model mBART  on the Annotated Corpora for Term Extraction Research (ACTER) dataset (Rigouts Terryn et al., 2019) utilized in TermEval 2020 as well as on the ACL RD-TEC 2.0 dataset (QasemiZadeh and Schumann, 2016). Since masked language and NMT models take sentences as input, the proposed ATE methods operate on sentence level. In spite of this reduced context of sentence input rather than documents or corpora, the models achieve F1 scores of up to 69.8% on ACTER, strongly outperforming the previous baseline of 48.1% .
An XLM-R-based sequence classifier relies on positive (term) and negative (non-term) samples, which are generated based on all n-grams up to a length of six of a given sentence. A second XLM-R-based token classifier decides for each word in a sequence whether it can be considered (part of) a term. Since the second model operates without upfront n-gram generation and only processes each sentence once, it is considerably more timeefficient than the first. Finally, the pretrained NMT model mBART is adapted to transform input sentences to sequences of comma-separated terms, an approach inspired by NMT-based ontology learning (Petrucci et al., 2018).
Analyses of results reveal interesting insights into the performance of the different input processing strategies and transformer-based models, including their ability to handle multi-word terms, training time required, and a comparison between baseline monolingual and multilingual language models in ATE. To achieve sentence-level ATE the ACTER dataset had to be preprocessed aligning terms with their occurrences in sentences, which we made publicly available together with our source code. 1 In summary, our main contributions are: (i) We show that transformer-based models can be successfully applied to ATE across three languages and five domains, without the need for text preprocessing or feature extraction; (ii) We show that ATE can be performed successfully on sentence level; (iii) We conduct robust experiments to show that our models outperform competitive baselines; (iv) We investigate the models' abilities to handle single-and multi-word terms, distinct term types, and differences in performance depending on train and test language combinations.
These classifications cannot easily accommodate recent neural ATE methods that generally operate on sentence level. An approach most closely related and our baseline by Hazem et al. (2020) utilized RoBERTa (Liu et al., 2019) for English and CamemBERT (Martin et al., 2020) for French and won the TermEval 2020 challenge. In their work, pretrained language models clearly outperformed a classification method based on a variety of features, such as statistical descriptors and the domain-specificity measure termhood (Kageura and Umino, 1996). A recently published approach (Rokas et al., 2020) relies on LSTM, GRU and BERT embeddings and achieves high F1 scores for ATE of Lithuanian terms in the cybersecurity domain. Several approaches build on word embed-dings to perform ATE on specific domains, such as medicine (e.g. Bay et al., 2020), or to separate general-language from domain-specific embeddings (Hätty et al., 2020). In contrast, our models perform ATE on four domains and in three languages utilizing a pretrained language and a pretrained NMT model. Extracting terms is also vital to learning expressive ontologies from text, for which Petrucci et al. (2018) train an NMT model to transform sentences to Description Logic formulas, an idea that inspired our NMT-based ATE model.

Language Models and NMT
Neural Language Models, which create contextualized language representations, were responsible for many of the recent improvements in NLP. Such models acquire rich contextualized language representations in a pretraining stage in which they learn to predict a masked word in a sentence, a task for which large amounts of training data are readily available. The thereby learned representations can be reused for various downstream tasks in the so-called fine-tuning stage, where task-specific layers are added on top of the pretrained language model. One of the most popular language models is BERT (Devlin et al., 2019), utilizing the transformer architecture (Vaswani et al., 2017). XLM-R (Conneau et al., 2020) is a multilingual variant of BERT, which was pretrained in 100 languages using 2.5 terabytes of Common Crawl data. Moreover, it makes use of the improved training routine introduced by RoBERTa (Liu et al., 2019).
Despite the widespread use of neural language models for NLP, adoption of such self-supervised pretraining approaches in NMT has only recently started to gain traction. NMT is traditionally performed with sequence-to-sequence encoderdecoder models that generate a target language output sequence based on a source language input sequence. Conventional language models trained on predicting masked words from a sequence, such as BERT, have only recently been incorporated into NMT (Zhu et al., 2020). A very interesting alternative is to pretrain an NMT transformer architecture, as done by  in form of a Bidirectional and Autoregressive Transformer (BART) . This is achieved by combining a bidirectional encoder similar to that of BERT with an autoregressive decoder, as seen in GPT (Radford et al., 2018). Thereby, contextualized language representations are trained and a model that is proficient in text generation and translation is created.  applied the BART architecture to large-scale monolingual corpora across 25 languages, creating multilingual BART (mBART) that can be directly fine-tuned for machine translation (MT).

Dataset
In order to compare to a strong baseline, we train and test on the ACTER dataset (Rigouts Terryn et al., 2019) utilized in the recent TermEval 2020 challenge. The domains wind energy and corruption represent the training set, dressage (equitation) the validation set, and heart failure the hold-out test set, for which the count of words and unique gold standard terms including named entities for English, French and Dutch are presented in Table 1.
In the ACTER dataset, words were labeled as specific, common, and out-of-domain (OOD) terms, and named entities (NE). Specific terms are understood by domain experts, while common terms might also be additionally understood by laypersons. OOD terms might be specific to a different domain, but used in the domain at hand, e.g. statistical terms in the medical domain.
Since the time of the challenge the dataset has undergone some minor updates, that is, unicode encoding, dash and quote normalization. 2 We believe that these minor normalization changes do not significantly impact comparability to TermEval results, which is confirmed by the fact that our most similar model to the baseline, the sequence classifier, achieves comparable results. Furthermore, the ACTER dataset provides terms as a single list for all documents in a domain. However, we required inline sentence-level term annotation, which we generated. In rare cases, this generation of inline annotations might have lead to erroneous results for single-word terms. For instance, the term "gain" as in "private gain" lead to the verb "gain" as in "gain acceptance" to be erroneously annotated in the corruption domain. We manually analyzed 300 inline annotated sentences and since the above example was the only error found, we consider this a negligible issue.
The fully inline annotated dataset ACL RD-TEC 2.0 (henceforth ACLR2) dataset provides cleaner training and test data and could therefore potentially further boost model performance as we show  (Zhang et al., 2018b) or F1 on Recoverable True Positives (F1@RTP) (Zhang et al., 2018a), due to the necessity to define an arbitrary cut-off point with traditional ATE methods. In another work attempting ATE with neural networks, due to the lack of an official data split and a restriction to domain specific terms, F1 scores are reported on arbitrary parts of the dataset (Kucza et al., 2018).

Neural Language Model-based ATE
We introduce two possible architectures for ATE based on the multilingual language model XLM-R.
For the experiments we use the base-size model version in form of the implementation made available by the transformers library (Wolf et al., 2019).

Sequence Classifier
As with the winning approach of TermEval 2020 (Hazem et al., 2020), our first architecture utilizes language models for binary sequence classification by using a fully connected layer to classify the representation of the special classification token <s>, which encoded by XLM-R carries information regarding the whole input sequence. Instead of using language specific models, however, we make use of the multilingual model XLM-R, which enables the use of a single model for all languages and has the ability to generalize to unseen languages. The model receives pairs consisting of a term candidate and a context sentence in which the candidate appears as input as exemplified in Table 2. Term candidates are created by producing all possible n-grams of a given sentence. Due to performance reasons and the term length distribution in the dataset (mostly <5 words), n-grams were only created up to a length of 6 words. For instance, given the input sentence "We meta-analyzed mortality using random-effect models" a positive sample, i.e., one labeled as term, is "random-effect models. We meta-analyzed mortality using randomeffect models", while a negative sample is "mortality using. We meta-analyzed mortality using random-effect models". For training the model, we undersample the negative samples so that their amount matches the amount of positive samples to compare to Hazem et al. (2020). For the evaluation on the validation and test set we use all possible n-grams for each input sentence, thus, creating a set of extracted terms which we can evaluate against the gold standard. The model was trained for 4 epochs with a batch size of 32 using the Adam optimizer with a learning rate of 2e-5.

Token Classifier
The second architecture we use for experimentation classifies each token of an input sentence separately, utilizing the same fully connected layer for all tokens after they have been processed by XLM-R. This leads to a significant reduction in training and inference time as each sentence has to be only processed once by XLM-R. This type of architecture is usually utilized in tasks like Named Entity Recognition (NER) (Devlin et al., 2019), where each word of a sequence needs to be classified.
The input provided to the model now simply consists of the sentences of the document which we want to process. The model then assigns each input token one of three possible output labels: "B-T" for the beginning of a term, "T" for the continuation of a term, and "n" in the case the token is not part of a term. For instance, the input sentence "We metaanalyzed mortality using random-effect models." would be labeled as 'n', 'B-T', 'B-T', 'n', 'B-T', 'T', 'n', with the last label annotating the punctuation at the end of the sentence. 3 Table 2 compares this input and output pattern with the other two methods. Since XLM-R's tokenizer is a Sentence-Piece tokenizer that splits the input into tokens on a subword level, the output labels obtained from the model are also subwords and have to be matched to the original words of the sentence afterwards. For training we used the Adam optimizer with a learning rate of 2e-5. Moreover, we used a batch size of 8 evaluating the model every 100 steps to be able to load the best model at the end.

NMT-based ATE
As a third experiment, we present a novel approach to ATE building on a recent sequence-to-sequence denoising auto-encoder model trained for NMT. We chose the recent and robust mBART model trained on the Common Crawl corpus in 25 languages (mBART25)  available in the Fairseq library .

Data Preprocessing for NMT-based ATE
Since we construct the downstream task of ATE as an MT task, we required parallel text data for supervised fine-tuning of mBART. We opted for a sentence-level approach, which specifically requires sentence-aligned parallel data. Sentence tokenization was performed with the Punkt tokenizer of NLTK and terms were inline annotated with the flashtext algorithm (Singh, 2017). For the ACLR2 dataset, individual sentences and the terms within were extracted with an XML parser. In order to distinguish single-and multi-word terms in the model's output sequence, a separator between terms or a unifying character between components of multi-word terms was required. Preliminary testing showed that using a semicolon surrounded by white-spaces ( ; ) as separator would achieve the same final F1 score as using more complex separators like a tag (for example <term>). Notably using an underscore (w w) to connect the individual constituents of a term (w) lowered the score of the output significantly, that is, F1 performance of the best model was 5.3% lower on average across all test languages when compared to utilizing semicolons. Irrespective of the separator, the model would at times add or omit a white-space between separator and term, which had the effect that the

Model
Input Example Output Example Sequence Classifier random-effect models.
We metaanalyzed mortality using random-effect models Term Token Classifier We meta-analyzed mortality using random-effect models We meta-analyzed mortality using random-effect models meta-analyzed ; mortality ; randomeffects models term would not be considered in the evaluation. This was remedied in the process of extracting individual terms from the output sequence and the results reported in Section 7 are with unwanted white-spaces removed. Tokenization during training was performed with SentencePiece (Kudo and Richardson, 2018) and data was binarized with the fairseq-preprocess CLI tool.

NMT Model Fine-Tuning
The pretrained mBART model was fine-tuned with the preprocessed data described in Section 6.1. Input to the encoder model was a given sentence, such as "Codes of conduct forbid corruption, irrespective of its intended purpose.", while the decoder would be shown the expected term labels, such as "codes of conduct ; corruption". No languagespecific tags were added to input or output, which is compared to the other methods in Table 2. For faster and more memory-efficient training we used automated mixed precision training of Fairseq with the Fused Adam Optimizer of the NVIDIA Apex PyTorch extensions. 4 We fine-tuned a separate model for each language of the dataset and a single model with all languages combined. Following the original publication of the pretrained model, each model was fine-tuned with 0.3 dropout, 0.2 label smoothing, 2500 warm-up steps and a learning rate of 3e-5. Furthermore, we opted for a dynamic batch size by limiting the maximum tokens per batch to 768, while updating the gradients every 4 steps (more details in Section 7.4).
While preliminary testing showed faster convergence and slightly higher final scores with higher tokens per batch, availability of the V100 GPU was not guaranteed and therefore training hyperparameters had to be adjusted to also run on an RTX2080Ti GPU, which limited the maximum tokens per GPU to 768. Model performance was evaluated every

Results
This section first presents the results on ACTER including an analysis per language (combination) and the results on ACLR2, then details the term length and type behavior of the models, and finally compares their training time efficiency. We additionally report on the validation performance of the best performing token classifier in Table 4, which shows some performance differences to the test domain, especially with French as training and validation language. For further comparability we also provide precision, recall and F1 scores at k top terms of 15 methods offered by the term extraction toolkit ATR4S, which implements a large range of existing ATE methods, in Appendix A.

Results on ACTER
To compare our results to the strongest participant of TermEval 2020, we report precision, recall and F1 scores in Table 3. These metrics are calculated on the basis of the available annotation in the original ACTER dataset, where we opted for the more comprehensive list of terms including named entities. All three models are evaluated on different combinations of training and test languages as shown in Table 3, where the heart failure domain is the hold-out test set as done for the SOTA baseline. The overall best results are marked in bold for each test language, while the best results of each model (if not bold) are highlighted in italics.
The overall best result for our approaches was an F1 score of 69.8%, which could be achieved by training the token classifier model on English and testing it on Dutch. With the exact same settings as the baseline (Hazem et al., 2020) that is based on RoBERTa (Liu et al., 2019) for English as training and test language, the token classifier   achieves an 11.6% higher F1 score and the NMT model an improvement of 6.5% on the F1 score. The sequence classifier struggles with precision and cannot outperform the baseline in this setting. Best performance for English as test language can be achieved by the token classifier trained on Dutch and by the NMT model trained on all languages. When testing on French, the sequence classifier is on par with the F1 baseline (Hazem et al., 2020) building on CamemBERT (Martin et al., 2020), while the token classifier outperforms it by 9.5% and the NMT model obtains an additional 7.8%. Best performance on French as a test language is achieved by the token classifier when trained on English and by the NMT model when trained on all languages again. The baseline for Dutch is provided by a bidirectional LSTM with GLOVE. 5 With Dutch as a test language, the sequence and token classifier achieve their best result 5 No system description paper was submitted for this approach after participation in the challenge. when trained on English, the NMT model when trained on Dutch.
A significant result is the substantial improvement of precision of the token classifier and NMT model over the baseline, even though the recall for English as test language lags behind. For French, the recall could be improved with the NMT model and matched by the token classifier when trained on English. Interestingly, the sequence classifier achieves a remarkable improvement on recall, however, lags behind on precision for all settings. This can be explained by the fact that we perform undersampling of the negative samples to match the number of positive samples, a strategy adopted from Hazem et al. (2020) to obtain comparable results. If undersampling is reduced, the precision and recall scores are more balanced and closer to the performance of the token classifier, however, training time is considerably increased. Another reason for the higher number of extracted phrases by the sequence classifier compared to the other models is that it can extract multi-word terms as well as words which are part of these multi-word terms separately, since both are used as input in the form of potential term candidate n-grams.
All three models show remarkable zero-shot transfer learning capabilities, i.e., they are trained on one language and show strong test scores on another. This is especially true for the token classifier, where models trained on a single language often outperform those trained on all three languages. This transfer learning ability across languages can also be observed in the overall highest F1 scores for the English test set, which was achieved by a model trained on Dutch, and for the French test set, which was achieved by a model trained on English.

Results on ACLR2
In addition to evaluating our models on the AC-TER dataset, we compared the two best performing architectures, i.e., the token classifier and the NMT model, on the ACLR2 dataset. Both models achieve similar test scores as reported in Table 5 and higher than the scores achieved on the ACTER dataset. As with the ACTER dataset, we additionally report validation performance of the best performing token classifier model in Table 4

Term-based Analysis
A qualitative analysis of the lists of false positives and false negatives based on the ACTER dataset demonstrated that all models handle acronyms well. This may be due to the text type in AC-TER, which is partially based on scientific abstracts that frequently introduce acronyms in brackets. If acronyms are part of the term, e.g. "LV strain rate", there was a high number of false negatives in both models. Moreover, false negatives occurred in all models if a term included a proper name and an apostrophe, e.g. "Chaga's disease" or "Cronbach's α", or frequently if it included a figure, e.g. "p38alpha" or "6-min walk test". In addition, named entities that included version numbers or consisted of multiple words often resulted in false negatives, e.g. "Self-Care of Heart Failure Index Version 6.2", "Multicenter Automatic Defibrillator Implantation Trial-Cardiac Resynchronization Therapy". In the token classifier and NMT model, the class of named entities of cities, e.g. "New York" and "Seattle", were frequently not identified as terms. False negatives also occurred in all models if it was a particularly long multi-word term, e.g. "resynchronization reverses remodeling in systolic left ventricular dysfunction". A tendency by the token classifier to split longer terms could be observed, e.g. splitting adjectives and nouns.
To quantitatively evaluate how well the different model types handled terms of different lengths, we computed the F1 scores individually for terms of a specific length, based on the terms in the AC-TER test set. The results in Table 6 were computed using the best model of each method, i.e., the model trained in English for the token and the sequence classifier and the model trained on all language for the NMT model. We can see that the scores of all models decrease with term length. Secondly, we observer that for English and Dutch the token classifier has the strongest results for all term lengths. However, for French the token classifier scores strongly decrease for multi-word terms, even though it is still the best model for unigrams. This is due to a very low recall, e.g. for 4-grams and higher the token classifier recalls only 7% of all French terms. The NMT model shows more consistency between languages, thus, performing strongest for French multi-word terms. As already the case with the overall scores the sequence classifier shows the highest recall values for both single-word and multi-word terms, however, lagging behind in precision, which leads to an overall lower F1 score.
Furthermore, based on the ACTER term type annotation (see Section 4), we could compare the types of terms extracted by the individual models. As can be seen in Fig. 1, the models all achieve a very similar distribution of extracted term types when compared to the gold test set distribution. We can observe, however, that the sequence classifier showed a slight tendency to extract more common and OOD terms and noticeably less NEs than the other models. All models tended to extract more specific terms, with the token classifier and the NMT model interestingly extracting comparatively few OOD terms.

Training Time Efficiency
Looking at the epochs required to reach the best score on the ACTER validation set, we can observe that in most cases the token classifier model requires not even a single training epoch. The token classifier models trained on the ACLR2 dataset need more epochs and achieve their highest scores after 3 and 5 epochs respectively. However, due to the lower training set size of the ACLR2 corpus, this also corresponds to less than 500 steps, thus, being similar with the training times reported for the models trained on the ACTER data. In comparison, the sequence classifier achieved its best performances on the ACTER validation set after 4 epochs of training. The NMT model also required several epochs to reach the best performance. Initially, all models were trained for 80 epochs, with the model having the lowest validation loss being loaded at the end. The models trained on monolingual data benefited from longer training compared to the models trained on the combined multilingual data. For completeness, we report the training epochs, label smoothed cross entropy loss, and log perplexity on the validation set for the best models. For the English dataset the reported score was achieved at epoch 49 with a loss of 5.82 and perplexity of 3.94. For the French dataset peak performance was reached at epoch 40, with a loss of 5.82 and perplexity of 3.78. Like the French model, the Dutch model achieved its best performance at epoch 40   Table 7. Training times denotes the full training time over all epochs without any validations. Validation time denotes the time for a single validation. The token classifier is the most efficient.

Discussion
Although the ACLR2 dataset is smaller in size than the ACTER dataset, the resulting F1 scores are considerably higher. Apart from the fact that it only covers a single domain, ACLR2 already provides inline annotations and more consistent term annotations, which seems to facilitate learning the task. Inconsistencies in the ACTER annotations were mainly noted when analyzing false positives of the models. For instance, "patient" is considered a common term in the heart-failure domain, but "serum" is not annotated at all, although in our view it would also qualify as common term.
We also noted that more training data does not necessarily increase model performance. As indicated by the training times on the ACTER dataset, the token classifier achieved its best evaluation scores long before training for a whole epoch, i.e., having seen only a small fraction of the available data before reaching its strongest performance.
In this paper we compare the performance of a pretrained monolingual language model baseline with pretrained multilingual language models. Previous work indicates that monolingual language models like RoBERTa or CamemBERT outperform multilingual language models on tasks posed in a single language (Rönnqvist et al., 2019). The difference increases the higher the complexity of the given task but is negligible on simple tasks that mostly rely on syntactic features. Since in our case the multilingual model XLM-R in form of a sequence classifier performs very similar to the sequence classifier-based RoBERTa model winning TermEval 2020, it indicates that successful ATE does not require very strong language understanding but corresponds more to simpler tasks relying mostly on syntactic features. Nevertheless, the remarkable zero-shot transfer learning of the multilingual models fine-tuned on a single language would also suggest that the multilingual pretraining might aid the model in defining what a term is, as highly domain-specific terms might be similar between languages tested, e.g. rooted in Latin. In the NMT output analysis, we found that the knowledge transfer between languages could cause curious side-effects, where at times terms are predicted by the model in a semi-translated way. For instance, when training on English the model would at times invent "toxicity cardiaque " for the French test set instead of extracting "toxicité cardiaque".
Besides stronger performance, the NMT model as well as the token classifier have a higher potential to better handle the possible extension of the term extraction task to include discontinuous entities, which, however, are so far not annotated in the datasets we used. An example of a discontinuous entity can be found in the expression "left and right ventricular failure", where "right ventricular failure" but also "left ventricular failure" are terms, the latter not being continuous in the original expression. While the NMT model does not require any special adaptations to deal with such an addition, the sequence classifier would have to consider many more n-gram combinations leading again to even higher training and inference times per sentence. To consider discontinuous entities with the token classifiers labels, the annotation and training process would have to be adapted to a multi-label token classification, e.g. the above phrase would be labeled as [B-T, n, n, T, T] and [n, n, B-T, T, T]. Since in the first label "ventricular" and "failure" are labeled as "T" they still clearly belong to the word "left" labeled as "B-T", which could be considered in a post processing step.

Conclusion
In this paper, we adapt and evaluate three transformer-based models on the task of ATE, building on pretrained multilingual language and NMT models. In this evaluation, these multilingual models outperform a baseline of monolingual language models and show remarkable zero-shot abilities. A token classification strategy building on a language model achieved the best performance, however, the NMT-based model seemed to be able to handle multi-word expressions more consistently across languages and not lag far behind in performance. One aspect that became very clear is a prevalence for quality over quantity when finetuning pretrained models to the task of ATE.
Recently, both NMT and masked language models show a trend towards increased input sequence capacity. Thus, it would be interesting to evaluate the impact of context length on the proposed models by testing with more domain context than only single sentences. Furthermore, to test the ability of the token classifier and the NMT model to handle discontinuous terms, such as elliptical expressions, a dataset containing and annotating such terms would be interesting.

Acknowledgments
The Text2TCS

Impact Statement
Automatically extracting domain-specific terms across domains and languages with high accuracy provides a valuable means to reduce time and resource effort in creating terminological resources. Such resources are important to ensure terminological consistency in specialized communication, such as communication between different groups in times of crisis, and to avoid misunderstandings.
From a technological perspective, we introduce multilingual pretrained language models to the field of Automated Term Extraction (ATE) with detailed tests on three different transformer-based models across four domains and three languages. Since these models support considerably more languages than tested, the approach can be transferred to other languages. This transfer capability has been tested by training in a specific language and then testing models in another language. Transfer capabilities extend to domains, since we trained and validated on three domains and achieved results strongly outperforming previous approaches on a previously unseen test domain. Up to this point, such flexibility has been achieved by statistical approaches, however, with considerably lower results in precision and recall. In contrast to previous ATE methods performing on corpora, our models extract terms on sentence level. This makes ATE more flexible since neither large domain-specific nor reference corpora are required.
From a societal perspective, terminological inconsistencies are a major source of misunderstanding in the communication among experts, between experts and laypersons, and between laypersons in reference to a specialized domain. This issue can be mitigated by publishing agreed upon designations for real-world phenomena in a specialized domain that can be consulted for domain-specific communication. However, manually preparing a collection of natural language terms is extremely human resource-and time-intensive. We reduce this workload for governmental institutions, private and public organizations, and private persons by providing a method to automate the detection of such domain-specific terms in natural language texts across languages and domains.
In terms of risk, such a highly flexible solution to automated term extraction fully depends on the quality of the input text. Misleading, erroneous, or biased contents will inevitably be propagated to the resulting terminologies. Relying on terminologies extracted from such problematic contents can negatively impact specialized communication or conclusions drawn from it. Thus, it is of vital importance for any user of this approach to mitigate the uncertainty of the reliability of extracted terms by only considering high-quality and reliable sources in the term extraction process and have domain experts carefully review the outcome prior to utilizing it in communication. We cannot guarantee that in a real-life setting all important terms have been extracted and all extracted terms are indeed central to the domain at hand. Furthermore, training neural network models is a process known to leave an environmental footprint, which we try to mitigate by fine-tuning pretrained models. Fine-tuning is less resource-and time-intensive than training from scratch, but still requires high-performance computing clusters.

A Appendix
For further comparability, we provide results of 15 prior term extraction methods provided by the ATR4S toolkit (Astrakhantsev, 2018). All methods provided by ATR4S are re-ranking methods based on a previous term candidate extraction process. Table A1 shows the results of ATR4S on the ACTER heart-failure domain in English. While some methods achieve good precision, most methods show precision scores below our best models, even at only 100 terms extracted. Increasing the manually specified amount of k terms to extract results in a decrease of precision in favor of recall. The scores of the different methods level out towards the maximum of 2,000 terms extracted. The best F1 score is achieved by the DomainPertinence method at 2,000 terms extracted with an F1 score of 30.32%. Table A2 shows the results of ATR4S on our ACLR2 test splits. One major drawback of prior methods is the required corpus size. The small test set in ACLR2 does not provide enough data for many of the statistical approaches or in fact the reranking to be effective at all after a certain amount of terms extracted. For the smaller Annotator 1 test set, we can observe virtually identical scores between all methods from 300 extracted terms onwards. For Annotator 2, this phenomena can be observed at 400 extracted terms. Best overall results are an F1 score of 21.83% for Weirdness at 200 terms extracted on the Annotator 1 test set and and F1 Score of 18.28% for Weirdness, PU and DomainPertinence at 300 terms extracted on the Annotator 2 test set. In comparison, our best models achieve an F1 score of over 75% for both Annotators.