XLTime: A Cross-Lingual Knowledge Transfer Framework for Temporal Expression Extraction

Temporal Expression Extraction (TEE) is essential for understanding time in natural language. It has applications in Natural Language Processing (NLP) tasks such as question answering, information retrieval, and causal inference. To date, work in this area has mostly focused on English as there is a scarcity of labeled data for other languages. We propose XLTime, a novel framework for multilingual TEE. XLTime works on top of pre-trained language models and leverages multi-task learning to prompt cross-language knowledge transfer both from English and within the non-English languages. XLTime alleviates problems caused by a shortage of data in the target language. We apply XLTime with different language models and show that it outperforms the previous automatic SOTA methods on French, Spanish, Portuguese, and Basque, by large margins. XLTime also closes the gap considerably on the handcrafted HeidelTime method.


Introduction
Temporal Expression Extraction (TEE) refers to the detection of temporal expressions (such as dates, durations, etc., as shown in Table 1). It is an important NLP task (UzZaman et al., 2013) and has downstream applications in question answering (Choi et al., 2018), information retrieval (Mitra et al., 2018), and causal inference (Feder et al., 2021). Most TEE methods work on English and are rule-based (Strötgen and Gertz, 2013;Zhong et al., 2017). Deep learning-based methods (Chen et al., 2019;Lange et al., 2020) are less common and report results on par with or inferior to the rule-based SOTAs.
Moreover, methods that work on other languages are rare, because of the scarcity of annotated data. We find that that there is considerable room for improving TEE, especially for low-resource languages. For example, the previous SOTA performance on the English TE3 dataset (UzZaman In the last three months  Developing an approach that can learn using the existing limited amount of training data is crucial for this field because of the effort required to develop high-quality rules for each language. Thus we propose a cross-lingual knowledge transfer framework for multilingual TEE, namely, XL-Time. We base our framework on pre-trained multilingual models (Devlin et al., 2019;Conneau et al., 2020). We then use Multi-Task Learning (MTL) (Liu et al., 2019a) to prompt knowledge transfer both from English and within the low-resource languages. For this, we design primary and secondary tasks. The primary task leverages the existing, annotated TEE data of the other languages. It transfers explicit knowledge that tells the forms of the temporal expressions in a source language. The secondary task maps the annotated source language TEE data samples to the target language using machine-translation tools, such as Google Translate, and acquires sentence-level labels (of the presence of one or more time expressions) from the original token-level labels. It constructs training data in a weakly-supervised manner. The secondary task transfers implicit knowledge by teaching the model to detect the presence of temporal expressions in text from the target language. Contributions. 1) We propose XLTime, which prompts cross-lingual knowledge transfer using MTL to address multilingual TEE. 2) We show that XLTime outperforms the previous automatic SOTA methods by large margins on four languages including French (FR), Spanish (ES), Portuguese (PT), and Basque (EU), which are "low-resource" for the TEE task. 3) We show that XLTime also approaches the performance of the heavily handcrafted HeidelTime (Strötgen and Gertz, 2013), and XLTime even outperforms it on two languages (Portuguese and Basque). We make our code and data publicly available. 1

Related Work
While TEE is an important problem in NLP, there is relatively little work in the area, and most of this work focuses on English. Prior art can be divided into two classes: rule/pattern-based and deep learning approaches. In the first class, Heidel-Time (Strötgen and Gertz, 2013) is the top performing approach to date, and covers over a dozen languages. It is driven by a collection of finely-tuned rules. The approach was later extended to more languages with HeidelTime-auto (Strötgen and Gertz, 2015), which leverages language-independent processing and rules. Other approaches include Syn-Time (Zhong et al., 2017), which is based on heuristic rules, and SUTIME (Chang and Manning, 2012) and PTime (Ding et al., 2019), which leverages pattern learning.
For the second class, Laparra et al. (2018) proposes a model based on RNNs. Chen et al. (2019) uses BERT with a linear classifier. Lange et al. (2020) inputs mBERT embeddings to a BiLSTM with a CRF layer and outperforms HeidelTime-auto on four languages. However, the reported performances of the deep learning-based methods are inferior to the rule-based ones, which is, in part, due to the complexity of the problem and training data paucity. In our work, we propose a new model which outperforms prior deep learning methods but also closes the gap considerably on HeidelTime, despite the data issues.
In addition, we are aware that applying label projection methods (Jain et al., 2019) can be a straightforward way to address the data scarcity in non-English TEE. TMP (Jain et al., 2019), originally proposed for cross-lingual named entity recogni-1 https://github.com/YuweiCao-UIC/XLTime tion (NER) (Lample et al., 2016), projects English data in IOB (Inside Outside Beginning) tagging format (Ramshaw and Marcus, 1999) to that of the other languages using machine translation, orthographic, and phonetic similarity packages. We show that the proposed XLTime, specifically designed to transfer temporal knowledge between languages, outperforms TMP by large margins.

Proposed Method
We formalize TEE as a sequence labeling task, similar to NER (Lample et al., 2016). The architecture is shown in Figure 1.

Pre-trained Multilingual Backbone
XLTime adopts SOTA multilingual models, i.e., mBERT (Devlin et al., 2019) and XLMR (Conneau et al., 2020) as the backbone. The pre-trained backbone contains lexicon and Transformer encoder layers as shown in Figure 1(a). The backbone allows XLTime to acquire semantic and syntactic knowledge of various languages. The backbone is shared by the MTL tasks introduced in Section 3.2.

MTL-based Cross-Lingual Knowledge Transfer
XLTime transfers knowledge from multiple source languages to the low-resource target language. The source languages include English and others for which TEE training data is available. We design primary and secondary tasks on top of the backbone to prompt explicit and implicit knowledge transfer. The primary task transfers knowledge that explicitly encodes the forms of the temporal expressions in a source language. It is formalized as sequence labeling and directly leverages the training data of the source language to train the backbone along with the primary task classifier, shown in Figure 1 (a). The primary task minimizes L sl : where b is the total number of input sequences and m i is the length of the ith sequence. x ∈ R d , output by the backbone, is the embedding of the jth token in the ith sequence. d is its dimension. c = argmax(W · x) and y ij are the predicted and ground-truth labels of the token. W ∈ R |c|×d is the parameter of the primary task classifier. |c| is the total number of unique ground-truth labels. 1(, ) is 1 if its two arguments are equal and 0 otherwise. The secondary task implicitly reveals how the temporal expressions would be expressed in the target language. We translate the sequences in the source language training data into the target language using Google Translate (we observe similar results with AWS Translate). The secondary task is formalized as binary classification, where the input samples are the translated sequences and the labels are sentence-level indicators of whether or not the sequences contain temporal expressions (which can be easily inferred from the original labels). This task tunes the model to learn the characteristics of temporal expressions in the target language in an implicit manner. It is weakly-supervised and requires no token-level labeling. It trains the backbone and the secondary task classifier by minimizing L bc : where x ′ ∈ R d is the sequence embedding output by the [CLS] of the backbone. W ′ ∈ R 2×d is the parameter matrix of the secondary task classifier. c ′ = argmax(W ′ ·x ′ ) and y ′ i are the predicted and true sequence labels of the ith sequence. We train XLTime concurrently on the primary and secondary tasks (further details found in Appendix B). An Illustrative Example. In Figure 1, Primary task -EN2FR and Secondary task -EN2FR transfer knowledge from English to French. Primary task -EN2FR reveals the exact forms of English temporal expressions using token-level labels (Y 11 and Y 12 ). Secondary task -EN2FR takes the French translations (X 41 and X 42 ) of X 11 and X 12 as input. Y 41 and Y 42 indicate whether the sequences contain temporal expressions or not (can be inferred from Y 11 and Y 12 ). Secondary task -EN2FR provides indirect knowledge about French temporal expressions. Similarly, Primary task -ES2FR and Secondary task -ES2FR transfer from Spanish to French.

Experiments
This section evaluates the proposed XLTime framework. Section 4.1 introduces the datasets, models evaluated, metrics, and experimental settings. Section 4.2 quantitatively shows how XLTime alleviates data scarcity and prompts TEE performances. Section 4.3 studies the effect of transferring knowledge from other languages in addition to English. We also qualitatively show how XLTime transfers knowledge to the target languages in an error analysis in Appendix E.

Experimental Setup
Datasets. We use the English (EN), French (FR), Spanish (ES), Portuguese (PT), and Basque (EU) TEE benchmark datasets. Table 2 shows dataset statistics. For each target language, we split its dataset with 10% for validation and 90% for test. For each source language (applicable to XLTime), we use the whole dataset for training. Baselines. We evaluate against rule-based, deep learning-based, and entity projection-based methods. We compare to the handcrafted HeidelTime (Strötgen and Gertz, 2013) and its automatically extended version, HeidelTime-auto (Strötgen and Gertz, 2015). We also compare to deep learning methods: BiLSTM+CRF (Lange et al., 2020), mBERT, base and large versions of XLMR. In addition, we compare to TMP (Jain et al., 2019), a cross-lingual label projection method which relies  on machine translation as well as orthographic and phonetic similarity packages (unavailable for EU).
Our Approaches. We test several variants of our proposed model, which can be broken into two classes: 1) Cross-lingual transfer from EN. We apply XLTime on mBERT, base and large versions of XLMR and use EN as the only source language.
2) Cross-lingual transfer from EN and others. We transfer from other languages in addition to EN. Evaluation Metrics. We report F1in strict match (UzZaman et al., 2013), i.e., all its tokens must be correctly recognized for an expression to be counted as correctly extracted. We follow the setting in prior work of evaluating "without type" and report the results without considering the types of the temporal expressions (e.g., for 'see you tomorrow', a prediction such as 'O O B-Duration' would be counted as correct, though the proper labeling would be 'O O B-Date'). 2 2 We do note that the temporal expression field should ultimately evaluate on the more complex task of identifying temporal expressions as well as their types. This is in the spirit of the annotations and is in line with other sequence Experimental Setting. We set d, the embedding dimension, to be consistent with the pre-trained multilingual backbone's dimension (768 for the base version language models and 1024 for large versions). We use AdamW (Loshchilov and Hutter, 2019) with a learning rate of 7e −6 and warm-up proportion of 0.1. We train the models for 50 epochs and use the best model as indicated by the validation set for prediction. All datasets are transformed into IOB2 format to fit the sequence labeling setting. All the deep learning methods are trained on English TEE datasets, validated and evaluated on low-resource languages. For BiLSTM+CRF, we use the hyperparameters as suggested in the original paper (Lange et al., 2020). For TMP, we use it to project the English dataset to the target languages, take the projected data to train the language models, then validate and evaluate on the target languages. We perform a grid search over {0.05, 0.1, 0.15, 0.25, 0.5} to tune δ, the similarity score threshold of TMP, and present the best performance. We repeat all experiments for 5 times and report the mean result. All experiments are conducted on a 64 core Intel Xeon CPU E5-2680 v4@2.40GHz with 512GB RAM and 1×NVIDIA Tesla P100-PICE GPU.

Multilingual TEE
We evaluate XLTime on multilingual TEE (see Table 3 and Appendix D). We observe: 1) XLTime-XLMRlarge outperforms the strongest automatic baseline by up to 9% in F1 on all languages. It even outperforms the handcrafted HeidelTime method by a sizable margin (24% in F1) in PT. 2) Applying XLTime improves upon the vanilla language models, even when transferring knowledge only from EN. E.g., XLTime-XLMRbase outperforms XLMR-base by 13%, 22%, 8%, and 54% in F1 on FR, ES, PT, and EU. 3) Introducing adlabeling tasks, such as NER. Therefore, we also experiment with the "with type" setting and show results in Appendix C. In both settings, the observations made in Sections 4.2 and 4.3 hold and XLTime outperforms the previous SOTAs by large margins.  ditional source languages to XLTime further improves the performance: the F1 improves by up to 19%, 11%, and 11% for XLTime-mBERT, XLTime-XLMRbase, and XLTime-XLMRlarge. 4) Hei-delTime is a very hard baseline to beat given the time and care that went into developing languagespecific rules. However, XLTime approaches its performance for FR and ES, outperforms it for PT, and makes predictions for EU (where Heidel-Time has no rules). Note the previous automatic SOTA, XLMR-large, also outperforms HeidelTime for PT, but not as significantly. This shows that the automatic methods are increasingly promising for the non-English TEE task. 5) XLTime-XLMRlarge improves upon XLMR-large by a large margin (11% in F1) in EU. For FR, ES, and PT, the improvements are smaller. This may because XLMR-large, compared to mBERT and XLMRbase, is already very knowledgeable (especially in FR, ES, and PT, which are more common than EU). Therefore, applying XLTime may not provide much improvement (in contrast, applying XLTime on mBERT and XLMR-base dramatically boosts F1 by 8-54%). 6) TMP performs poorly probably because the falsely projected entities can mislead the language models. Specifically, the token-bytoken machine translation and matching process of TMP does not work well for temporal entities, especially when the target language TEs contain definite articles, prepositions, etc., that do not have explicit matches in the source language. E.g., EN TE 'yesterday morning' can be correctly map to FR TE 'hier matin' ('yesterday' to 'hier' and 'morning' to 'Matin') but not to EU TE 'ayer por la mañana' ('yesterday' to 'ayer' and 'morning' to 'Mañana', leaving 'por' and 'la' unmatched).

Transfer Knowledge from Additional Languages
We also study the effect of transferring additional knowledge from a low-resource language in addition to English, see Table 4 and Appendix D. Our assumption is that similar languages (FR, ES, and PT) would help each other (one exception is PT, as the published dataset is EN text translated to PT and we, therefore, don't expect machine translation to provide additional knowledge). We observe: 1) In most cases, transferring additional knowledge from similar languages (blue cells) does dramatically improve performance (underlined cells), with F1 increasing by up to 13%. 2) In some rare cases, negative knowlege transfer (Wu et al., 2020) occurs as adding source languages hurts performance (e.g., EN, ES → PT scores lower than EN → PT for XLTime-XLMRbase). We hypothesize this is related to the quality of the datasets and plan to address this in the future.

Conclusion
We propose XLTime for multilingual language TEE in low-resource scenarios. It is based on language models and leverages MTL to prompt crosslanguage knowledge transfer. It greatly alleviates the problems caused by the shortage in training data and shows results superior to the previous automatic SOTA methods on four languages. It also approaches the performance of a highly engineered rule-based system. Sen Wu, Hongyang R Zhang, and Christopher Ré. 2020. Understanding and improving information transfer in multi-task learning. In Proceedings of ICLR 2020.

References
Xiaoshi Zhong, Aixin Sun, and Erik Cambria. 2017. Time expression analysis and recognition using syntactic token types and general heuristic rules. In Proceedings of ACL 2017, pages 420-429.

A Types of the Temporal Expressions
According to ISO-TimeML (Pustejovsky et al., 2010), the TEE dataset annotation guideline, there are four types of temporal expressions, i.e., Date, Time, Duration, and Set. Date refers to a calendar date, generally of a day or a larger temporal unit; Time refers to a time of the day and the granularity of which is smaller than a day; Duration refers to the expressions that explicitly describe some period of time; Set refers to a set of regularly recurring times (Pustejovsky et al., 2010). Compute gradient and update model parameters

B The Training Procedure
We adopt mini-batch-based stochastic gradient descent (SGD) to train XLTime, as shown in Algorithm 1. To concurrently train on the primary and secondary tasks, we split the training data of both tasks into mini-batches and randomly take one mini-batch at each step. We then calculate loss using that mini-batch and update the parameters of the shared backbone as well as the task-typespecific classifier. The classifier of the other task type is unaffected. Table 7 shows the full results for low-resource language TEE with/without considering the types of the temporal expressions. Note that the superiority of our proposed XLTime over the previous automatic SOTA still holds. Tables 8 and 9 show the full results for low-resource language TEE with additional source languages.

E Comparative Error Analysis
This section qualitatively shows how the proposed XLTime framework transfers knowledge to the target language. Specifically, we show how the errors made by the vanilla multilingual models can be fixed by applying XLTime. We also show how applying XLTime on other languages in addition to English would help fix more errors. We compare mBERT and XLTime-mBERT (transfer from EN) on FR TEE. Table 5 summarizes cases where mBERT fails while XLTime-mBERT gives correct predictions. We can tell that XLTime-mBERT learns 'hier (yesterday)', which is not understood by the mBERT model. XLTime-mBERT also learns to recognize vague time spans such as 'désormais (from now on)' and 'longtemps (long time)', which are missed by the mBERT model. Moreover, compared to mBERT, XLTime-mBERT understands FR grammar better, as it recognizes the roles of definite articles and adjectives, such as 'le (the)' and 'prochain (next)', in TEs. In a word, the proposed XLTime framework helps connect the concepts in EN to the corresponding ones in FR.
To show how applying XLTime on extra source languages would help fix more errors, we compare XLTime-mBERT (transfer from EN) and XLTime-mBERT (transfer from EN and ES) on FR TEE. Table 6 summarizes the TEs that the former fails while the latter gives correct predictions. We can tell that by leveraging ES as an additional source language, XLTime-mBERT better masters FR grammar. Specifically, it learns to recognize definite articles and prepositions that share similar (e.g., 'le/los') or identical (e.g., 'de' and 'en') forms in ES and FR. It can also better distinguish TEs of different types (e.g., it learns that 'quelques jours  (a few days)' is a Duration, instead of a Date). One interesting fact is, when transferring solely from EN, the model recognizes some extra TEs that are not in the ground truth of the FR dataset. This is because of an inconsistency in data labeling: 'daily' is considered as a Set in the EN dataset, while its counterpart, 'quotidiens' is overlooked in the FR dataset. The proposed XLTime framework eliminates the needs of manually labeling multiple datasets and therefore, can be applied to minimize data label inconsistency.

F Language Models on English TEE
In our early experiments, we reexamine the language models on English TEE. This section presents the results.

F.2 Evaluation Results
Tables 10, 11, and 12 show the results. We observe: 1) When ignoring the types, the language models are inferior to SynTime on TE3, on par with or better than the rule-based methods on Wikiwars and Tweets. 2) When considering the types, the language models outperform the previous SOTAs by 11-22%, 18-21%, and 30-41% in F1 on TE3, Wikiwars, and Tweets datasets.