Automatic Rule Generation for Time Expression Normalization

The understanding of time expressions includes two sub-tasks: recognition and normalization. In recent years, significant progress has been made in the recognition of time expressions while research on normalization has lagged behind. Existing SOTA normalization methods highly rely on rules or grammars designed by experts, which limits their performance on emerging corpora, such as social media texts. In this paper, we model time expression normalization as a sequence of operations to construct the normalized temporal value, and we present a novel method called ARTime, which can automatically generate normalization rules from training data without expert interventions. Specifically, ARTime automatically captures possible operation sequences from annotated data and generates normalization rules on time expressions with common surface forms. The experimental results show that ARTime can significantly surpass SOTA methods on the Tweets benchmark, and achieves competitive results with existing expert-engineered rule methods on the TempEval-3 benchmark.


Introduction
Temporal information plays an important role in natural language. The research community divides the understanding of time expressions into two subtasks: recognition and normalization (UzZaman et al., 2013). The first task is to annotate time expressions from free text, and the second one is to annotate the temporal values and types of the recognized time expressions. Some recent research work (Zhong et al., 2017;Zhong and Cambria, 2018;Ding et al., 2019) achieved significant improvements on the recognition task comparing with classic rule-based or semantic parsing systems, while the researches on normalization have lagged behind. Normalization methods often rely on expert-designed rules or grammars to model the compositional structure of time expression, which are domain-sensitive and not sufficient enough on covering emerging corpora.
To avoid the performance limitation and the labor cost of manually designing rules for different corpora, we study the problem of automatically generating normalization rules from annotated data. There are some challenges to achieve this goal. Firstly, the surface text forms of natural language expressions are diverse, and the normalized value of time expressions may not directly correspond to their surface text form. (e.g., both the expression "May" and "this month" could be normalized to "2021-05".) Secondly, time expressions have rich semantic structures which are not explicitly reflected in their annotations. The implicitness of semantic structure makes supervised approaches hard to apply to the task of generating normalization rules. Besides, the annotations in practical datasets are noisy, which challenges the robustness of data-driven methods.
To achieve the goal, we regard time expression normalization as a sequence of operations to construct the normalized temporal value of specific types. We assume that the surface form of time expressions activates the corresponding normalization sequence. The normalization rules are defined as the alignment between surface form pattern and activated operation sequences, as demonstrated in Example 1. Section 3 will describe operations and normalization rules in details.
Example 1. The time expression "last October" can be normalized by the rule (Pat-tern="last MONTH:$1", Type=Instant, Opera-tions=(ToLast [Year], ModifyEnum[$1])), where the type "Instant" indicates that the normalized value should be a date or time instant, the first operation decreases the current value on year field by 1, and the second operation modifies the value on month field by the "MONTH" variable obtained from the expression (i.e., October).
We name the method for automatically generating normalization rules as ARTime 1 . ARTime computes the difference between the base value and the annotated value of input time expression to capture possible operation sequences, aligning the captured sequence with the surface form of the time expression to construct candidate rules. It ranks the noisy candidates by their frequency to distinguish the good rules. When applying the rules for normalization, ARTime attempts to dynamically search a rule composition for unmatched expressions to improve the coverage of generated rules. The whole normalization process only relies on a small set of pre-defined lexicon of temporal values (e.g, numeric values and time units), and does not need the intervention of human experts.
The rest sections are organized as follows: The second section summarizes related research work. The third section introduces the representation of temporal values and time expressions in detail. The fourth section describes the framework and main components of ARTime. The fifth section reports the evaluation results of ARTime on two benchmarks. The last section concludes this paper.

Related Work
Understanding time expressions in natural language has long attracted the attention of researchers. The TIDES research program proposed TIMEX (Setzer and Gaizauskas, 2000) and TIMEX2 (Ferro et al., 2005), which are standalone annotation schemes of time expression with detailed descriptions of temporal values. The TERQAS workshops conceptualized TimeML (Pustejovsky et al., 2010) based on TIMEX and TIMEX2. TimeML became an ISO standard in 2009. Bethard and Parker (2016) pointed out that the classic annotation schemes failed to show the semantic composition structure of of time expressions and proposed the Semantically Compositional Annotation of Time Expressions (SCATE). However, applying SCATE to existing corpus requires to manually re-annotate the expressions in a more complex way, and many of the existing SOTA methods can not handle annotation in SCATE format directly (Laparra et al., 2018).
On the recognition of time expression, an early study shows that the complexity of time expressions is limited, and finite state automata or regex expression can be effective for recognizing those expressions (Hobbs et al., 1997). Mainstream recognition methods can be roughly divided to surface-structure-based methods (Verhagen et al., 2005;Strötgen and Gertz, 2010;Strötgen et al., 2013;Chang and Manning, 2012;Lee et al., 2014;Zhong et al., 2017;Ding et al., 2019) and sequential-tagging-model-based methods (Bethard, 2013;Ning et al., 2018;Zhong and Cambria, 2018). Research work in recent years achieves significant improvements on the recognition. SynTime (Zhong et al., 2017) defines generic but heuristic rules on a group of time-related triggering token types. TOMN (Zhong and Cambria, 2018) uses the SynTime defined token types instead of the classic BIO-tagging scheme for the CRF model. PTime (Ding et al., 2019) generalizes time expressions in training data to sequential patterns and selects a subset of the patterns for recognition. However, these studies only focus on the recognition.
The normalization of time expression is dominated by methods with expert designed rules or grammars. HeidelTime (Strötgen and Gertz, 2010;Strötgen et al., 2013Strötgen et al., , 2014 uses regex rules on time tokens and modifiers to combine recognized tokens and filter ambiguous expressions. SUTime (Chang and Manning, 2012) proposes a 3-layered temporal pattern language. It firstly extends recognized tokens to string, then composes and filters the strings to get temporal values. Angeli and Uszkoreit (2013) use an EM-style bootstrapping approach to learn a PCFG parser on pre-defined preternminals. UWTime (Lee et al., 2014) uses a combinatory categorical grammar to parse possible meanings of time expressions. It selects meanings for recognized expressions via a linear classifier with context-dependent features. CogCompTime (Ning et al., 2018) provides a rule-based standalone normalizer conceptually built on Zhao et al. (2012), which achieves the SOTA normalization results on the (UzZaman et al., 2013) dataset. There are also some efforts on understanding event-related expressions. Tissot et al. (2015) analyzes time expressions in clinical notes. TweetTime (Tabassum et al., 2016) improves existing methods by establishing an external event knowledge base. According to existing studies (UzZaman et al., 2013;Tabassum et al., 2016), rule-engineering can achieve good results on covered expressions but are hard to extend to emerging corpora.
In this paper, we focus on automatically recover-ing the semantic structure of expressions without any compositional annotations. The latest work on recognition inspired our idea of using surface form patterns to activate normalization rules, and we replace the labor cost of designing rules by the automatic rule generation.

Time Expression Normalization as a Sequence of Operations
We model the normalization of a time expression as a sequence of operations defined on time fields, which can construct a temporal value of specific type. The normalization rule is defined as a triplet consists of a surface form pattern, a type of temporal value, and an operation sequence. The following subsections introduce the above concepts.

Time Fields
The time fields can be simply treated as time units with lower and upper bound constraints on values. Each temporal value can be denoted by a series of non-overlapping fields. For example, ISO:8601 represents a date value in the format "yyyy-MM-DD", where "MM" represents the "month" field with lower bound 1 and upper bound 12. 2

Operations
ARTime takes the function of time expression as changing a base temporal value to a target value. The semantic of a time expression is represented by a sequence of operations defined the temporal fields. We design ten types of operations for AR-Time (as listed in Table 1. The operations take 5 kinds of parameters: 1) integer values v, 2) time units u, 3) temporal fields f , 4) enumerable temporal constant e, and 5) approximate reference r (i.e., Past, Present and Future). Most of the operations are designed for temporal values of instant type, while ApproxRef and Add are designed for approximate reference values and duration values respectively. Specifically, we use a MakeSet operation to represent the TIMEX3 type "SET". In the execution of operations, we require the operations be arranged in order. Operations on larger fields should be executed first. Operations on the same fields will be arranged according to their type. The operations independent to the base (e.g, ModifyVal) should be executed first. The reason to use descending order of granularity is that the order corresponds to the way humans understand time fields. For example, the token "day" denotes "dayOfYear" in "the first day in 2021" and "dayOfWeek" in "the first day in this week". Its meaning depends on the larger fields mentioned in the context. Arrange operations according to their type is to prevent redundant sequences. Example 2 explained why executing some operations later may overriding the execution results of previous operations.

The Surface Form Pattern of Rule
In our design, each rule has a surface form pattern to determine whether it can be applied to an input expression. The pattern in our approach is similar to the sequential pattern in PTime (Ding et al., 2019), which is defined as a sequence consisting of token types and untyped tokens. A token type consists of multiple values, and each value has a corresponding regex to capture its various surface forms. We only use 6 token types listed in Table 2 for obtaining variable values. The 6 types including 4 kinds of enumerable temporal constants (i.e, the first 4 rows in the table), time units, and inequality modifiers (denoted as "IN_EQ") collected from HeidelTime.
In our method, only the tokens referring to temporal values that appear in the operation sequences

Action Description
ModifyVal [v, f ] Modify the value in f to v.  will be generalized to the corresponding type. For example, the token "day" in rule (Pattern="several day later", Type=ApproximateReference, Opera-tions=(ApproxRef[FUTURE_REF]) is not generalized to corresponding type "TIME_UNIT" since the operations do not require a unit variable.
4 Framework of ARTime Figure 1 illustrates the normalization process of ARTime. The pre-processing step is adopted from the corresponding components in PTime. The rest normalization procedures can be divided into two parts, 1) generating rules (i.e., the left part of Figure  1) and 2) applying the generated rules (i.e. the right part of Figure 1). Since the TimeML standard does  not annotate the base value of each time expression, we simply use the document creation time as a substitute in capturing the possible operation sequences. The following sub-sections describe the key techniques in ARTime. Section 4.1 details how to capture possible operation sequences. Section 4.2 describes how to generate rules from the noisy results. Section 4.3 describes how to use the generated rules to normalize input time expressions.

Capturing Possible Operations
By regarding temporal values as vertices and operations as directed edges connecting the base values to the normalized values, the task of reasoning possible operation sequence can be formalized as searching paths on the graph of temporal values, where each path corresponds to a sequence of operations (as demonstrated in Figure  2). The main challenge is that there could be a great quantity of paths between two values, and not all of them correspond to meaningful expression in daily communications (  Our method is based on the assumption that practical time expressions are low-redundancy sequences. i.e., we prefer direct sequences like (Equal[Day])("today") rather than the complex ones of the same meaning such as (To-Last[Week],Forward[37,Day])("7 days after a week ago").
We implement the process by a heuristic depthfirst search (DFS) algorithm described in Algorithm 1. The main idea is to guide the search process by the difference between the base value and the annotated value. In each iteration, we ensure that the current value V c and the target value V t are the same on fields of granularity not less than the iterated field f . (line 1). We enumerate a smaller field f (line 8) and check if there are some operations a on field f corresponds the difference between V c and V t from f to f (line 13-14). In the enumeration of a (line 12), we only consider no-redundancy sequences of the partial order introduced in section 3.3. Specifically, we accelerate the process by requiring all numeric values that appear in the search results must also appear in the input S ← ∅ 5: if ∆ = 0 then 8: Given the time expression T with annotated value V a and the base time V b , we obtain possible operation sequences by calling DFS(V b , V a , ∞, numVals(T )), where ∞ is a virtual time unit as the initialization condition and numVals is the function for collecting appeared numeric values.

Constructing and Filtering Rules
All the captured operation sequences will be used for constructing candidate rules. We firstly find the values appear in both the surface form and the operation sequence, then replace its appearance with corresponding token types and variable symbols to construct candidate rules. For example, given the expression "this month" and operation sequence Equal[Month], the replacement result will be "this TIME_UNIT:$1" and Equal[$1].
The generation produces many noises since there are more than one sequence from one time value to another. We distinguish good rules by a quite simple intuition that more general patterns and more correct rules should appears on more expressions. We rank the candidate rules by their frequency and the frequency of their patterns on training corpus, then select the most frequent rules for normalization. We suppose that there is no need to drop the low frequency rules. The reason is that a low frequency rule either be replaced by more generalized rules (e.g., the second rule in Example 3), or do capture some meaningful token patterns that are difficult to generalize (e.g., "as soon as possible").

Applying Rules For Normalization
Given an input expression, ARTime will try to find a matchable rule to normalize it. If it can not match any generated rules, ARTime will attempt to search a consecutive composition of rules and stop words to cover it. The stop words include connecting symbols (e.g., "-"), determiners (e.g., "this"), prepositions (e.g., "to") and so on.
The search process is performed by a segmentation algorithm (i.e., the dynamic programming algorithm described in Algorithm 2.) The algorithm tries to cover the input expression except for stop words in it (line 6-7) with minimum rules (line 11-13). For the case that there are multiple compositions of the same size, we simply choose the one that contains the most frequent rules. After that, we assume that all the operations in chosen rules are useful and merge them into a new sequence according to the order described in section 3.3.

Datasets
We use the TempEval-3 (UzZaman et al., 2013) benchmark and the Tweets benchmark proposed by Zhong et al. (2017). 3 The statistics of the two benchmarks are illustrated in Table 3.
TempEval-3 (UzZaman et al., 2013) is a sub-task in SemEval 2013 consisting of English news articles. We follow the previous study (Lee et al.,3 Benchmarks with lots of event-related time expressions like Wikiwars (Mazur and Dale, 2010) and Tabassum et al. (2016)'s tweets dataset are not used in our evaluations. The reason is that understanding those expressions requires the external knowledge of the events, which is not our focus. Initalize F ← to an array of empty sets. 3:   (Pustejovsky et al., 2003) and AQUAINT as its training datasets.
Tweets (Zhong et al., 2017) is a new benchmark consisting of English tweets. The annotators tend to annotate years in a finer granularity (e.g. the annotation "... in T value=2014-XX-XX 2014 /T " means "a day in 2014".) These annotations are legal according to TimeML, but do not conform to the intuition of expert designed rules in existing methods. Thus we provide the alter-version Tweets-M by annotating the year expressions as is.

Compared Methods
We compare ARTime with 4 normalization systems, HeidelTime (Strötgen et al., 2013), UWTime (Lee et al., 2014), SUTime (Chang and Manning, 2012) and CogCompN (Ning et al., 2018). HedidelTime is the SOTA purely-rule-based system. UWTime achieves the SOTA performances on TempEval-3. SUTime outperforms the other ones on social media texts according to Tabassum et al. (2016). Cog-CompN is the standalone normalizer of CogComp-Time (Ning et al., 2018) which achieves SOTA results on TempEval-3. We also evaluate the performance of compared normalization methods in real applications. We implement end-to-end systems with 3 SOTA recognition methods, SynTime (Zhong et al., 2017), TOMN (Zhong and Cambria, 2018), and PTime (Ding et al., 2019). We directly use the output of HeidelTime, SUTime, and UWTime for end-to-end comparison because that they use the same rules (or grammar) for recognition and normalization.

Evaluation Metrics
We use the scripts 4 provided by TempEval-3 for evaluation. For the normalization results, we report the accuracy of normalized temporal results with gold mentions. For the end-to-end results, we report the F1 score of normalized types, and the precision (Pr), recall (Re), and F1 score of normalized temporal values. Table 4 reports the normalization results on gold recognition annotations. ARTime surpasses other methods and shows better adaptability and robustness on Tweets (i.e., +3.8 points on the original Tweets). The performances of the compared methods dramatically vary on the different corpus. All compared methods except SUTime achieve very poor results on Tweets, while SUTime achieves the worst results on TempEval-3. ARTime's performances are not very well on TempEval-3. The main reason is that the training data and the test data of TempEval-3 are annotated separately, and the insufficiency of training data severely hurts the performance of purely data-driven methods like ARTime according to previous study (Ding et al., 2019). For example, the test data of TempEval-3 includes 2 expressions about "flu season" (It should be normalized as winter), our method cannot handle them since none of the training expressions contains the word "season". Besides, the normalized values of some expressions rely on the tenses of corresponding utterances and need to be re-computed by post-modification (Strötgen and Gertz, 2010;Lee et al., 2014). (e.g, "finished in June" denotes "June in last year" for base temporal values like "2021-05"). The above problems can be alleviated by introducing prior knowledge. We transform the expert rules in HeidelTime into ARTime's formats as pre-defined rules, name the combined approach as ARTime+H. ARTime+H achieves a good balance on different domains with the best results on Tweets-M and competitive results on TempEval-3. (i.e, 1.5 points lower than the SOTA results on values.) Table 5 reports the end-to-end results on TempEval-3 and Tweets-M. ARTime with the SOTA recognition method (PTime) outperforms the existing methods with an improvement of +2.2 points on the F1 scores of normalized values on Tweets-M. The results of ARTime on TempEval-3 are not good enough, but can be easily improved by introducing the same prior knowledge used in HeidelTime. ARTime+H with SynTime achieve the second-best results on the F1 score on values without losing the advantages on Tweets (1.4 points higher than the best results achieved by compared methods).

Analysis
We categorize the negative samples in the normalization results of ARTime by their causes in Table  6. About half of the negative samples are due to unseen patterns that can not be captured by our rules. Another problem is the errors caused by tense in the context. Some existing systems apply post-modification tricks by comparing the tense to the positivity of the difference between the output value and the base value. If our method can correctly utilize the oracle tense information, the accuracy on TempEval-3 can increase to 79.7% (+4.3 points). There are also some cases that the rules generated in our method do not fit the input expressions (The 3rd row in Table 6). We also manually analyzed the rules used in the test process to show what extent the introduction of expert rules replaces the automatic generation in ARTime+H, the results are illustrated in Table 7. The "Full" column reports the number of rules used in normalizing the expressions, and the "Auto" and "Ratio" columns report how many of those rules can be covered by automatically generation. From the results we can know that the automatic generation can cover over 90 percent of the manual rules and adding about 2 rules are enough for ARTime.

Running Efficiency
All the results of ARTime are obtained by a singlethreaded Scala implementation on a personal workstation with an Intel Xeon CPU E5-1607 v4 @ 3.10GHz CPU and 128GB RAM. In average, AR-Time generates ∼4.8 candidate rules for each expression. The offline training process took ∼16.3 minutes on TempEval-3 and ∼13.5 minutes on Tweets. The test process took ∼47 seconds on TempEval-3 and ∼46 seconds on Tweets.
In this paper, we mainly focus on automatically generating rules for time expression normalization. The main contributions of this paper are summarized as follows: · We model time expression normalization as an operation sequence to construct the normalized temporal value, and ten basic operations are defined for time expression normalization.
· We present a novel method, called ARTime, for generating normalization rules from training data without expert interventions. Specifically, AR-Time captures possible operation sequences from annotated data and generates candidate rules on time expressions with common surface forms, and finally obtains normalization rules by ranking the candidate rules.
· Our experimental results show that ARTime outperforms SOTA methods on the Tweets benchmark, and achieves competitive results with existing expert-engineered rule methods on the Tempeval-3 benchmark. The end-to-end results when combining ARTime with time expression recognition systems are also very competitive.
There are still some rooms to improve ARTime. One of the future work is to generate more highquality rules. The other is to enable ARTime to use the tense and event information in context.