Predicting Numerals in Text Using Nearest Neighbor Language Models

,


Introduction
Real-world objects and events have various quantitative properties, such as size, weight, length, and price.Commonsense about these quantitative properties is essential for a deep understanding of texts containing numerals and for reasoning on a similar or better level than humans.Figure 1 shows examples of a masked numeral prediction (MNP) task requiring quantitative commonsense.The first example requires deriving a numeral referring to height that is considered tall based on commonsense about the distribution of human heights.The second example requires deriving the value of a typical movie length using commonsense and subtracting it from 12 p.m. Humans can easily choose numerals that approximately correspond to the ground-truth answers to these questions with considerable confidence.However, for models that lack such commonsense and computational skills, such inferences pose a challenge.
One of the main reasons why LMs fail to perform well on tasks that require quantitative commonsense is that they do not learn the mapping between strings of numerals and their magnitudes accurately.Naive LMs treat numerals in text only as string tokens, that is, other words.While humans can associate the magnitude of a numeral with the string of the numeral, LMs that treat numerals only as strings are unable to accurately make such associations for arbitrary numerals (Wallace et al., 2019).This makes it difficult for naive LMs to understand the magnitude of numerals, resulting in difficulty in acquiring quantitative commonsense.
To address this problem, previous studies have attempted to employ methods such as using word embeddings of numerals that reflect the magnitudes of the numerals (Wallace et al., 2019;Thawani et al., 2021), adding texts of arithmetic formulas to the training data (Geva et al., 2020), training LMs with a loss function that depends on the magnitudes of numerals (Sakamoto and Aizawa, 2021), and tokenizing numerals in a text into single digits to allow LMs to understand the concept of digits (Spithourakis and Riedel, 2018).However, these methods require fine-tuning or additional pretraining specific to the understanding of numerals.Therefore, in this study, we aim to improve the performance of LMs in a task that requires quantitative commonsense (specifically, the MNP task (Spithourakis and Riedel, 2018;Lin et al., 2020;Sundararaman et al., 2022)) without such additional training that is specific to numerals by using the k-nearest neighbor LM (kNN-LM) (Khandelwal et al., 2020b), which is an LM extended by a retrieval-based method.
In addition, numerals have a higher rate of occurrence of out-of-vocabulary (OOV) words than regular words (Spithourakis et al., 2016a), making it difficult even for recent large neural LMs to accurately predict numerals in sentences from the context.In this study, based on the hypothesis that numerals that appear in similar contexts tend to be of the same type (e.g., date, amount of money, and number of people) and similar sizes, we expect that kNN-LM model will improve the accuracy of the MNP task by reflecting numerals that appear in similar contexts in the prediction results.We also believe that an advantage of using kNN search is not only the improvement in top-k accuracy but also the improvement in interpretability by providing contexts with similar use of the predicted numeral.
In our experiments, we used the pre-trained BERT of HuggingFace (Wolf et al., 2020) as the base LM for kNN-LM.Two types of context ranges were used to compute the representation of the context of the masked numeral: the numeral mask and its surrounding words (Figure 3 (a)), and the numeral mask and its subsequent words (Figure 3 (b)).Contextual range Figure 3 (b) is expected to improve search accuracy by focusing on words that follow the numerals, such as units, which strongly represent the type of the preceding numerals.
For both ranges, nearest neighbors were searched based on context representation, which is computed as the average vector of the embedding vectors of all the words in the range.kNN-LM outperformed the base LM in MNP on most of the datasets we used.In addition, it was confirmed that the range of only the mask token for a numeral was the most effective context range for k-nearest neighbor (kNN) search in MNP.
To summarize, our contributions are as follows: • We apply kNN-LM to MNP and show that the retrieval-based method can improve the performance of pre-trained LMs without additional training specific to numerals.
• We experiment with several different types of context ranges and find the optimal range for kNN search for the MNP task.
• We analyze the prediction accuracy for numerals included in the model vocabulary and those not included and confirm that kNN search significantly improves the performance for OOV numerals, which are difficult to predict with naive LMs.
2 Related Work

Masked Numeral Prediction
The MNP task can be used as a probing task to evaluate the quantitative commonsense acquired by LMs.Spithourakis and Riedel (2018) evaluated the numeracy of a long short-term memory model by using the MNP task and concluded that current LMs have a problem with learning the mappings between strings of numerals and their magnitudes.Therefore, to help the models understand the magnitude of numerals, they proposed a method to predict numerals as continuous Gaussian distributions and a method using character-level recurrent neural networks (Graves, 2013;Sutskever et al., 2011) for prediction, which led to an improvement in their prediction accuracy.
Lin et al. (2020) used the MNP task with uniquely determinable masked numerals, such as "A bird usually has [MASK] legs" or "A car usually has [MASK] wheels," and evaluated quantitative commonsense acquired by BERT and RoBERTa.They showed that even pre-trained LMs that achieve comparable performance to humans on many NLP tasks perform significantly worse than humans on this task.In addition, although the pre-trained LMs seemed to make correct predictions, they often failed to maintain these predictions even for small sentence changes that did not change the answer, such as when the target sentence was changed to "A car usually has [MASK] round wheels."This finding implies that achieving suitable robustness of model predictions is also a challenge.
In this study, considering these problems of the current LMs, we do not revise the base model itself but reinforce the model predictions by using a retrieval-based approach, specifically, a kNN search computed on the similarity of the contexts.

Retrieval Augmented Methods in NLP
Retrieval-based approaches, which refer to the datastores as external knowledge, have been successful in many NLP tasks (Meng et al., 2021), such as named entity recognition (Wang et al., 2022), machine translation (Khandelwal et al., 2020a), and question answering (Guu et al., 2020).
kNN-LM is an LM whose predictions are augmented with the results of a kNN search for similar texts (Khandelwal et al., 2020b).The detailed design of the model is described in Section 4. When predicting a masked word in a sentence, kNN-LM searches the dataset for sentences similar to the context around the masked word.It aims to improve prediction accuracy by reflecting the searched nearest neighbors in the prediction score of the base LM.Since kNN search is based on the distance in the embedding space of the base LM, it has the advantages of not requiring additional training for the search and of being able to use any dataset as the datastore for the search.Improvements in perplexity from 18.65 to 15.79 on the WikiText-103 dataset (Merity et al., 2016) are reported.Khandelwal et al. (2020b) also found that kNN-LM is particularly useful for predicting rare patterns due to the augmentation provided by the retrieval-based approach.Based on the hypothesis that it may also be effective in predicting numerals, where rare patterns occur frequently (Spithourakis et al., 2016b), we applied kNN-LM to MNP in this study.

Task Description
In this study, we used the MNP task to evaluate the numeracy of LMs.This task is defined as follows: Input: A passage containing exactly one target numeral masked with a special token "[MASK]"

Output: A ranking of predicted numerals
There is exactly one masked numeral per passage, and the prediction model can see the other numerals in the same passage when making predictions for passages with more than one numeral.We initially considered masking multiple numerals in a passage; however, we decided to limit the number of masked numerals to one because masking multiple numerals would make the prediction difficult even for humans (e.g., "Restaurant reservations are preferred after [MASK] p.m. because the movie starts at [MASK] p.m.") and a single mask is more suitable for investigating whether LMs can capture the semantic relationship between the numerals.

Evaluation
The LMs (including kNN-LM) generate a probability distribution over numeral tokens in their vocabulary using a softmax function.The top-k accuracy is a metric that evaluates the predicted ranking of the numeral tokens created from the generated probability distribution (Lin et al., 2020).the ground-truth numeral token is within the top k predicted tokens in the ranking.

It calculates the percentage of predictions such that
The top-k accuracy simply evaluates whether the ground-truth numerals are included in the top k predictions.It does not consider how close the predicted numerals are to the corresponding ground truth.However, in the MNP task, a model that predicts numerals closer to the ground truth is generally considered to be a better model, even if the predictions are incorrect.Therefore, in this study, we used the top-k accuracy with a fixed numerical error percentage allowed in each calculation to evaluate the LMs in terms of the magnitude of the difference between the ground-truth numeral and the predicted numeral.In our experiments, we used k = 1, 3, 5, and 10 for evaluation.
4 Nearest Neighbor Language Model kNN-LM (Figure 2) predicts masked tokens in input sentences y using two different approaches, namely an LM and a kNN search (Khandelwal et al., 2020b).It then adds these prediction scores together with a mixture ratio λ to obtain a final prediction score p(y): where λ is a fixed parameter, p kNN (y) is the prediction score of kNN search calculated using the softmax function on the negative distance between the test context and the top k similar contexts in the datastore, and p LM (y) is the prediction score reported by the LM.
In kNN search, two types of context ranges are used: the numeral mask and its surrounding n words (see Figure 3 (a)) and the numeral mask and its subsequent n words (see Figure 3 (b)).For both ranges, the average of the embedding vectors of all the words in the range is defined as the context representation of the masked numeral.While Khandelwal et al. (2020b) used only the words before the mask to calculate context representations, we used the aforementioned two types of context ranges for the following two reasons.First, in our experiments, we used BERT as the base LM, which is a bidirectional LM.Second, we hypothesized that words that are more closely related to the magnitudes of numerals, such as units, tend to appear around the numerals, especially after them.
From each dataset, 70% of the total passages were used as training data, 10% as validation data, and the remaining 20% as evaluation data.The main results on the FinNum and DROP datasets are shown in Appendix C, considering that their trends were generally the same as those of the other datasets.
The statistics of the passages and numerals contained in the aforementioned datasets are shown in   1.
Table 2 shows the statistics on OOV numerals across the four datasets.In particular, the percentages of the three main categories of numerals that are not included in the BERT vocabulary are presented, namely decimals, numerals with commas, and large numerals.The category "#large numerals" includes numerals larger than 6,000, which is the largest numeral in the BERT vocabulary.The aforementioned categories have intersections with each other.The trend of OOV numerals appearing in the dataset varies significantly depending on the domain and writing style.It can also be observed that decimals account for the majority of OOV numerals in all datasets.

Experimental Setup
In the experiments, we used the BERT model "bertbase-uncased" from HuggingFace Transformers (Wolf et al., 2020) as the base LM for kNN-LM.This base LM was used to make predictions from the context of masked numerals.The word embeddings for kNN search were the output of the second-to-last layer of this model.In this paper, the kNN-LM using a BERT model fine-tuned by the MNP task as the base LM is called the kNN-LM fine-tuned by the MNP task.
BERT-DExp (Berg-Kirkpatrick and Spokoyny, 2020) and NumGPT (Jin et al., 2021) are powerful baselines that deal with the prediction of numerals from context.These methods reflect the numeral's magnitudes in the numeral embeddings and have improved the ability to roughly predict numerals (i.e., the rate of agreement for the number of digits).However, we did not adopt these models as the base LM for kNN-LM in this study because we believe that methods that reflect the numeral's magnitudes in the numeral embeddings can have a negative impact on the accurate prediction of numerals, thereby losing the advantages of retrievalbased approaches, which are beneficial in terms of accuracy.
Infrequent numerals, decimals, and numerals with commas are not included in the naive BERT vocabulary; thus, such numerals in the datasets are split into multiple numeral tokens by the BERT tokenizer in the preprocessing stage.Tables 1 and 2 show the percentages and statistics of OOV numerals; such numerals in the datasets are split into multiple numeral tokens by the BERT tokenizer in the preprocessing stage.However, in the test set, to prevent partial masking, the numerals are masked with a single token (i.e., without splitting them first).Consequently, it may be impossible for naive   LMs to predict the masks of OOV numerals with zero-error rate.However, the frequency of numeral tokens in the BERT vocabulary ensures predictions with an error of less than 10% (except for large numerals; see Table 2).Since OOV numerals rarely appear in the datastore for kNN search, a zero-error rate would be hardly possible regardless of the single token masking.Therefore, we believe that our methods can be fairly compared to the others even with this masking strategy.
For kNN search, we set k = 50 and used the L 2 norm for calculating the distance of the context vectors.The mixing ratio of kNN search results and LM prediction scores was set to λ = 0.2 based on the results of our preliminary experiments.The experimental results are given in terms of average scores of two or more runs.Other experimental settings are shown in Appendix A.  3. kNN search with only the embedding vector of mask tokens for masked numerals achieved the highest accuracy in the MNP task, in both context ranges.We suggest that this may be because the embedded representations of the mask tokens of numerals contain sufficient information to predict the masked numerals near the last layer of the fine-tuned LM.The results of the experiment comparing two context ranges on the ACLsent dataset are shown in Appendix B. Initially, we expected that the context range after the mask would be more efficient than the range before and after the mask because it can effectively utilize units that often follow numerals.However, we did not observe a significant difference between them.In the following experiments, the results of kNN search were obtained when using only the embeddings of mask tokens, which exhibited the best accuracy in the experiment presented in this section.

Masked Numeral Prediction
Tables 4 and 5 show the top-k accuracy of kNN-LM on the Numeracy-600K and ACLsent datasets for the MNP task (without and with fine-tuning on the task).The results for the FinNum and DROP datasets are shown in Appendix C.
By comparing the prediction accuracy before and after, we observed that fine-tuning the base LM on the MNP task improved the prediction accuracy of kNN search and kNN-LM on both datasets.This confirms the effectiveness of fine-tuning the base LM in kNN search and kNN-LM.In both cases, before fine-tuning, kNN search outperformed the LM in terms of accuracy.In particular, on the Numeracy-600K dataset, the largest dataset used in our experiments, kNN search significantly outperformed the LM both before and after fine-tuning.However, on the ACLsent dataset, the smallest dataset used in our experiment, the performance difference was not as pronounced, indicating that dataset size can influence the extent of improvement through fine-tuning.These findings demonstrate that with a sufficiently large datastore, kNN search can achieve moderate prediction accuracy  without additional fine-tuning, unlike LM.When comparing the prediction accuracy of each method after fine-tuning, we found that on both datasets, the prediction accuracy of kNN search alone or kNN-LM exceeded that of the base LM alone by approximately 2% to 5% in all settings.This confirms the effectiveness of kNN search in the MNP task.In particular, kNN search demonstrated superior accuracy after Top3 and achieved a margin of error of 10% or more, suggesting that it can retrieve a more diverse set of numerals as predictions compared to LM.
Table 6 lists the top-k accuracy of the fine-tuned kNN-LM for numerals in and out of the BERT vocabulary in the Numeracy-600K dataset.The results for the numerals included in the vocabulary show almost the same trend as the overall results (Table 4).By contrast, kNN search significantly outperformed the LM in predicting the OOV numerals.Although it is challenging to accurately compare their performance owing to the vocabulary and datastore limitations affecting LM and kNN search, respectively, we believe that in settings allowing for a small margin of numerical error, their performance can be considered fairly comparable.The results for OOV numerals in the ACLsent dataset are shown in Appendix D.

Output Examples of kNN Search for OOV Numerals
Table 7 shows the top-5 output examples of kNN search for masks of OOV numerals in the Numeracy-600K dataset.An LM fine-tuned on this dataset with the MNP task was used for kNN search.In each sentence, one numeral is shown in bold, indicating that kNN search was performed with the bold numeral masked.
The OOV numerals are masked, and their low frequency of occurrence makes it difficult to find contexts in the datastore wherein the same numerals appear.However, this result shows that kNN search could find contexts that are remarkably close to that of the test context, although the exact match accuracy of numerals was not high.In the first example, kNN search found a context for an earthquake of similar magnitude, and the test "no injuries or tsunami reported" and the first-predicted context "Tsunami warning not expected" are extremely close.The second example is considered one of the most difficult for kNN search because the answer "2065" does not appear in the datastore.However, despite the short context, it correctly estimated that the masked numeral is a future year and succeeded in finding a considerably close numeral in the datastore, although it was the third prediction.
However, the results also reveal a limitation of kNN search.In the third test sentence, the numeral with "times" as the unit is masked.Although kNN search outputted contexts that contain numerals with "times," as the unit in all of the top 5 cases, it failed to find contexts that contain numerals that are close to the answer.This may be because a deeper understanding of the contexts is required for masked numerals with units such as "times" which allow for a wider range of preceding numerals.Similarly, there were cases where kNN search was not extremely effective in predicting the amount of money that followed the "$", which was considered to allow for a wider range of numerals.
While kNN search achieved successful predictions in some cases and faced challenges in others, as shown in the table, humans can easily understand the rationale behind the predictions (e.g., same units or similar contexts).This improved interpretability stands as a significant advantage of kNN search over LMs.

kNN Search in Cross-Domain Setting
Table 8 shows the results of kNN search with a datastore from different domains.We performed kNN search with Numeracy-600K, ACLsent, and DROP as datastores for the Numeracy-600K and ACLsent datasets.The results show that the accuracy of kNN search in the cross-domain setting was significantly lower than that achieved using the same-domain datasets as the datastore.This indicates that the types and properties of the numerals in these three datasets differ greatly, and in many cases, similar contexts and numerals were not found by kNN search.These results suggest that kNN search can achieve the best performance only with a datastore that contains a larger number of diverse sentences compared to those used in this study.Future developments should focus on experiments and analysis of kNN search using a large-scale datastore.

Conclusion
In this study, we applied kNN-LM to the MNP task and quantitatively evaluated its prediction accuracy.The results show that the numerical absolute errors were reduced by utilizing kNN search for numeral prediction compared to existing methods.In particular, the prediction accuracy greatly improved for numerals not included in the model vocabulary, which are difficult to predict with naive LMs.We also experimented with two different context ranges and confirmed that the most effective method for kNN search is the one using only the word embedding of the mask token for the masked numeral as a representation of the context.

Limitations
One of the limitations of our study is that the performance of kNN search is highly dependent on the domain of the datastore used.As shown in Section 6.4, kNN search, like standard LM, does not work well for contexts and numerals for out-ofdomain data.This dependence can be reduced by increasing the size of the datastore and introducing passages from various domains; however, this strategy may bolster another limitation, as discussed hereafter.
The second limitation is that kNN-LM requires more memory usage for the datastore and higher latency for search during inference compared with standard LMs.Although the search process itself can be executed swiftly by leveraging efficient similarity search libraries like Faiss (Johnson et al., 2017), as the size of the datastore expands, the time required to obtain their representation vectors is expected to increase.
The third limitation pertains to the lack of language variety in the utilized datasets.While we deliberately selected datasets from different domains for our experiments, they shared a common language, namely English.Consequently, it is expected that kNN-LM will exhibit similar effectiveness in languages with linguistic structures similar to English.However, conducting experiments on non-English datasets is necessary to provide evidence for the language-independent impact of kNN-LM.This aspect will be addressed in future research endeavors.

A Experimental Setup
We used the Adam optimizer with learning rate and max-grad-norm set to 5 × 10 −5 and 1.0, respectively.All the words in the passages were tokenized with the BERT tokenizer; passages were then truncated to sequences of 512 tokens or less.In this study, only numerals expressed in arithmetic digits, such as "1" and "2022," were treated as target numerals to be predicted, and numerals expressed in English words, such as "one" and "ten," were not included.

B Results of Context Range Comparison on ACLsent
The result of using different context ranges for kNN search on the ACLsent dataset is shown in Ta-ble 9.The same trends observed in the Numeracy-600K dataset were confirmed (Table 3).

C Results of MNP Task on Other Datasets
Prediction results for kNN-LM on the FinNum and DROP datasets are shown in Tables 10 and 11.On the FinNum dataset, kNN search exhibited a better accuracy than LM without fine-tuning, and kNN search only or kNN-LM had the best accuracy in most settings with fine-tuning.This is the same trend observed in the Numeracy-600K and ACLsent datasets (Tables 4 and 5).By contrast, the results show a different trend for the DROP dataset.With fine-tuning, kNN search alone or kNN-LM almost always had the best accuracy in most settings, but without fine-tuning, the LM significantly outperformed kNN search.This may be because the DROP dataset differs from the other datasets in that each passage is longer (Table 1).When the passage is long, it is possible to check numerals other than the masked one in the same passage, and if there are answers or near-answer numerals among them, it can be solved as a simple reading comprehension task, which LMs perform well, without kNN search.

D Results of MNP Task for OOV Numerals in ACLsent
Table 12 shows the top-k accuracy of fine-tuned kNN-LM for numerals in and out of the BERT vocabulary in the ACLsent dataset.The same trends observed in the Numeracy-600K dataset were confirmed (Table 6).

Figure 1 :
Figure 1: Examples of the masked numeral prediction task requiring quantitative commonsense.

Figure 2 :
Figure 2: Overview of kNN-LM for the MNP task.

Figure 3 :
Figure 3: Two context ranges of a masked numeral for kNN search.

6. 1
Methods for Representing the Context for kNN SearchThe results of kNN search using different context ranges are shown in Table

Table 1 :
Statistics across four different datasets (training set).

Table 2 :
Statistics on OOV numerals across four different datasets (training set).

Table 3 :
Top-k accuracy of kNN search on the Numeracy-600K dataset when two different context ranges are used to compute the contextual representation: one with the mask and the n words before and after it (Figure3(a)), and one with the mask and its subsequent n words (Figure3 (b))."% of NE" indicates the percentage of numerical error allowed in each top-k accuracy calculation.
Table1.Numeracy-600K, ACLsent, and FinNum have only a few sentences per passage compared to DROP, which has longer passages.ACLsent and DROP contain 5-15 numerals per passage, while Numeracy-600K and FinNum contain less than 5 numerals per passage.The types of numerals appearing in the passages also differ depending on the dataset domain.Numeracy-600K and DROP contain more four-digit numerals, such as year numbers, compared with the other datasets.Partly because of this reason, they also have a relatively lower percentage of decimals and OOV numerals, which are not included in the BERT vocabulary.ACLsent and FinNum contain many decimals and infrequent numerals, such as numerals from experimental results and statistics and monetary values and percentage changes in stock prices, as reported by the statistics in Table

Table 4 :
Top-k accuracy of kNN-LM on the Numeracy-600K dataset."kNN," "LM," and "kNN+LM" indicate the accuracy of kNN search alone, the accuracy of the base LM, and the accuracy of the entire kNN-LM, respectively."% of NE" indicates the percentage of numerical error allowed in each top-k accuracy calculation.

Table 5 :
Top-k accuracy of kNN-LM on the ACLsent dataset.

Table 6 :
Top-k accuracy of the fine-tuned kNN-LM for numerals included in and out of the vocabulary numerals in the Numeracy-600K dataset.

Table 7 :
Top-5 output examples of kNN search for masks of OOV numerals in the Numeracy-600K dataset.

Table 8 :
Top-k accuracy with 0% error of kNN search in cross-domain settings.

Table 9 :
Top-k accuracy of kNN search on the ACLsent dataset when two different context ranges are used to compute the contextual representation: one with the mask and the n words before and after it (Figure 3 (a)), and one with the mask and its subsequent n words (Figure 3 (b))."% of NE" indicates the percentage of numerical error allowed in each top-k accuracy calculation.

Table 10 :
Top-k accuracy of kNN-LM on the FinNum dataset.

Table 11 :
Top-k accuracy of kNN-LM on the DROP dataset.

Table 12 :
Top-k accuracy of the fine-tuned kNN-LM for numerals included in and out of the vocabulary numerals in the ACLsent dataset.