Do Language Models Understand Measurements?

Recent success of pre-trained language models (PLMs) has stimulated interest in their ability to understand and work with numbers. Yet, the numerical reasoning over measurements has not been formally studied despite their importance. In this study, we show that PLMs lack the capability required for reasoning over measurements. Furthermore, we find that a language model trained on a measurement-rich corpus shows better performance on understanding measurements. We propose a simple embedding strategy to better distinguish between numbers and units, which leads to a significant improvement in the probing tasks.


Introduction
The success of pre-trained language models (PLMs) has led to more research on their ability to understand commonsense.In this context, numerical reasoning over text (NRoT) is a NLP model's ability to interpret and work with numbers in either digit or word form (Spithourakis and Riedel, 2018).Recent studies on NRoT test PLMs to answer questions on numeracy (Wallace et al., 2019), scalar magnitude comparison (Zhang et al., 2020), numerical facts (Lin et al., 2020), and math word problems (Wu et al., 2021).
Despite these efforts, existing works lack an analysis of the forms in which numbers appear.In particular, we focus on the case where numbers appear as a measurement in the context.In most scientific articles, measurements are an integral part of the context for capturing its appropriate meaning.For example, the two sentences "40g of Aspirin is lethal" and "40mg of Aspirin is lethal" contain the same words except for the unit of measurement (UoM), but the second sentence is incorrect because of the UoM.
In this work, we examine the measuring skill of PLMs: the ability to understand the system of measurement and perform numerical reasoning over measurements.We design three measuring skill tests (MSTs) and study how many measuring skills can be acquired.Specifically, UNIT CONVERSION, REFERENCE RANGE DETECTION, and MEASURE-MENT COMPARISON require understanding of the system of measurement, the normal range of the biomedical entity, and the ability to combine knowledge about the system of measurement and NRoT, respectively.Table 1 shows an example of each of the measuring skill tests.
MST results showed that the models struggled to find the largest (or smallest) value on the list of measurements and convert the measurement to another unit, while they performed well on other tests.Compared to other PLMs, BioBERT (Lee et al., 2020) showed superior performance on UNIT CON-VERSION and REFERENCE RANGE DETECTION, which implies that pre-training with measurementrich text helps the model understand the system of measurement.Finally, we speculate that the lack of skills to distinguish numbers, units, and other words in the context makes the models fail in some MSTs.To mitigate this, we introduce scale embedding, which provides the model with the information regarding the position and scale of the numbers in the input text.We show that scale embedding significantly improves the MST performance of all PLMs.

Measuring Skill Test
In this section, we describe three MSTs to carefully study the ability of PLMs to understand the system of measurement and perform numerical reasoning over the measurements.

Unit Conversion
This task requires the model to decide whether the two measurements represent the same quantity.For example, the model might correctly predict [MASK] in a sentence, such as "3.5g and 3500mg are [MASK] value" to be filled with same if it under-   MSTs).We underline the correct answer for each example.
Table 2: Templates which we used for data generation.
[M], [LoM], and [ENT] are the placeholder for the measurement, the list of measurements, and the biomedical entity, respectively.stands the conversion of units correctly.In general, it is a convention to combine the unit (e.g., liter, meter) and its prefix (e.g., kilo, milli) to represent the numerical value of the measurement within a range [10 −3 , 10 3 ).Therefore, various unit prefixes can appear in a single passage, even if the units are the same.To handle this, UNIT CONVERSION is essential for complex reasoning over measurements.To succeed in UNIT CONVERSION, we expect the model to handle the unit and numerical value jointly, based on an understanding of the system of measurement.

Reference Range Detection
Given a biomedical entity and measurement, this task requires a model to predict whether the measurement falls within the reference range.Knowledge of the biomedical entity plays a crucial role in understanding measurements, since the unit is determined by the biomedical entity.For example, we measure the hemoglobin level in g/dL.In addition to understanding UoMs, PLMs must rely on domain knowledge embedded in their parameters to solve this task, as context alone does not provide sufficient clues as to what the reference range is for the given biomedical entity.

Measurement Comparison
Given two measurements (or a series of n measurements), the task is to predict the correct relationship between them.We created the synthetic dataset following other well-known NRoT tasks.Here, we consider three numerical reasoning tasks: COMPARISON (Talmor et al., 2020), ARGMIN/MAX (Wallace et al., 2019), and SORT-ING (Pal and Baral, 2021), all requiring the model to compare numbers.Note that each measurement in this task can have a different unit prefix.For example, the sample "1.59mg is [MASK] than 3.8g" containing two different units "mg" and "g" appears in the COMPARISON dataset.This task assesses the model's ability to combine an understanding of measurements and numerical reasoning skills.

Experiments
Probing Setup We formulated MSTs as a Cloze test (Talmor et al., 2020) to fully utilize the knowledge captured by masked language modeling (MLM).Specifically, a PLM received the masked inputs given in Table 1, and the MLM head output the probability distribution of the answer candidates for [MASK].Among the answer candidates, we chose the one with the highest probability as the final prediction.
We probed four transformer-based PLMs.BERT (Devlin et al., 2019) and ALBERT (Lan et al., 2020) were trained on Wikipedia articles and Book Corpus.BioBERT (Lee et al., 2020) was trained on biomedical articles from PubMed abstracts, and BlueBERT (Peng et al., 2020) used both clinical (MIMIC-III (Johnson et al., 2016)) and biomedical (PubMed abstracts) corpus for pretraining.We also tested a randomly initialized transformer encoder (i.e.Scratch) to evaluate the difficulty of our MSTs.For each model, we did not update the parameters during training, except for the MLM head in the last transformer layer.In all tasks, the models were trained with three random seeds and we report the mean classification accuracy for all the probing tasks.Appendix A provides further details on training and evaluation.Data Preparation We manually crafted templates in Table 2 that contained at most two slots for measurements and [MASK] token for an answer.We instantiated [M] and [LoM] by sampling the measurement and the list of measurements, respectively.For measurement sampling, we independently sampled a number and a unit and then combined them.Specifically, we sampled units from the predefined set in Table 7 which consists of SI units and some units in MIMIC-III.
The numbers in the training dataset were sampled from [10 −2 , 10 2 ).For evaluation, we constructed two evaluation datasets: 1) Interpolation sampled numbers from the same range as the training dataset; 2) Extrapolation sampled numbers from [10 −3 , 10 3 ).Note that we did not consider the numbers outside the range [10 −3 , 10 3 ), because many of the unit prefixes are in the power of thousands.Zhang et al. (2020) reported that representing numbers in scientific notation made it easier for the language model to capture the scale of numbers.Following this observation, we tested two different number notations: decimal and scientific.For example, 32.6 can be represented as 32.6 and 3.26E+01 in decimal and scientific notation, respectively.We randomly varied the number of digits after the decimal point between zero and three, and the significant digits were maintained after converting the number notation.
For REFERENCE RANGE DETECTION, we collected biomedical entities from six tables in MIMIC-III (INPUT, OUTPUT, LAB, PRESCRIP-TION, PROCEDURE, and CHART) and chose the subset.
We report the number of samples and the distribution of labels for each MST in Table 8.

Results and Analysis
Measuring Skills of PLMs Table 3 shows the results of MSTs stated in Section 2.
PLMs performed reasonably well on COMPARI-SON, SORTING, and REFERENCE RANGE DETEC-TION, but struggled considerably on ARGMIN/MAX and UNIT CONVERSION tasks.This shows that some measuring skills are difficult to learn from an LM objective.Similar to previous NRoT studies (Wallace et al., 2019;Pal and Baral, 2021), PLMs often failed to successfully extrapolate to values outside the training range.Further, in most cases, MST results got worse when we represented numbers in scientific notation.
We observed that BioBERT outperformed other PLMs in UNIT CONVERSION, REFERENCE RANGE DETECTION, and COMPARISON, and showed comparable performance in the rest of the MSTs.Compared to BioBERT, BlueBERT was pre-trained on a larger volume of biomedical text, but showed worse performance.This shows that pre-training on measurement-rich corpora assists the model in acquiring measuring skills, but further training on the noisy clinical text could harm it when performing reasoning over measurements.We also found that ALBERT outperformed its competitors in SORT-ING even though it performed the same or worse on other tasks.This may be because ALBERT benefits from its sentence order prediction (SOP) objective, which predicts the ordering of two consecutive segments of text.
Effect of using Different Prompts One can expect that the choice of prompt has an impact on the results, and recent studies (Jiang et al., 2020;Petroni et al., 2019)   in Table 3 are maintained as the prompt differs, we trained and evaluated PLMs on three distinct sets of prompts: CONTEXT, UOM, and LABEL.Specifically, CONTEXT, UOM, and LABEL examine how consistent MST results are against various linguistic expressions of prompts, the set of unique UoMs in the dataset, and the choice of answer candidates, respectively.Note that we considered answer candidates as part of the prompt, since the prompt determines the set of correct answers.
For CONTEXT, we manually created four additional templates that have the same meaning as the original template in Table 2.For UOM, we used only a subset of units g, l, m, and s, which appear frequently in the general text.For LABEL, we included synonyms of the label as answer candidates.For example, "less", "smaller", and "lower" are the answers for the prompt "1.59mg is [MASK] than 3.8mg.".More details of the experiments are in the Appendix B.
The results with the decimal notation are shown in Table 4.We can see that the results vary with the choice of prompt, indicating that PLMs are indeed sensitive to it.However, we found that MST performance maintains a similar tendency in every experiment: BioBERT works well on COM-PARSION, UNIT CONVERSION, and REFERENCE RANGE DETECTION, and ALBERT works well on SORTING.
Rule-based Conversion of Measurements Measurements exhibit a certain pattern, regardless of the domain, because of a global standard: the International System of Units (SI).Thus, we can manually detect and convert all units in the text without difficulty.Then, it is natural to wonder if converting all units based on rules is easier than making the language model understand the system of measurement.To answer this question, we tested the rule-based conversion that detects measurements with the regular expression and converts them into a prefix-free form.For example, the sentence "2.5mg is [MASK] than 3.8g" is converted to "0.0025g is [MASK] than 3.8g" after the rulebased conversion.We examined the rule-based conversion on MEASUREMENT COMPARISON and REFERENCE RANGE DETECTION.
The results with the decimal notation are shown in Table 5.The rule-based conversion increased MEASUREMENT COMPARISON performance because the converted MEASUREMENT COMPARISON does not require an understanding of unit conversion to solve the problem.However, it can be seen that almost all models became worse on REFER-ENCE RANGE DETECTION.This shows that the knowledge about the reference range is highly correlated with the specific UoM.Thus, the rule-based conversion is a suboptimal choice if we want to utilize the domain knowledge embedded in PLMs.Table 6: Effect of scale embedding on MSTs.We report the classification accuracy and performance improvement (∆) after applying scale embedding.
Scale Embedding and its Effect In Section 4, we observed that none of the PLMs showed a perfect understanding of each MST.We suspect that such a gap originates in the deficiency of PLM's ability to extract numerical values from measurements and compare their magnitudes.To this end, we propose scale embedding, an additional embedding that provides the model with the information of the position and scale of numbers in the input text.As described in Figure 1, we incrementally assigned the index to each token from the end to the beginning of a sentence.If we encounter a token that is not included in the numerical value, then we reset the index to zero and keep assigning the index zero to tokens until another numerical value appears.We distinguished between numerical and nonnumerical subwords using the regular expression.Note that we trained only the scale embedding and MLM head while freezing other pre-trained weights of the language model.This allows us to adapt the model to any numerical reasoning tasks simply by plugging a different scale embedding into them.
Table 6 shows the MST results after the scale embedding is applied to all models, where we can see significantly improved test results, even for ARGMIN/MAX and UNIT CONVERSION. 1 Note that the scale embedding is minimally effective for Scratch, except for COMPARISON.This shows that solving our MSTs requires more than just simple embeddings, and a PLM that understands context is an essential element. 1 The full set of experimental results are shown in Table 13 in the Appendix.

Related Works
Over the years, numerical reasoning has been an active research area.Some works investigate the numeracy of static word embeddings (Naik et al., 2019), contextualized language embeddings (Wallace et al., 2019), and multilingual words (Johnson et al., 2020).Wallace et al. (2019) shows that ELMo, BERT, and GloVe embeddings are capable of capturing numeracy, but only within the range of numbers seen during training.Gen-BERT (Geva et al., 2020), NumGPT (Jin et al., 2021), andNT5 (Yang et al., 2021) focus on incorporating arithmetic skills into pre-trained models.Another task that deals with numerical quantities is measurement estimation.VerbPhysics (Forbes and Choi, 2017) proposes the dataset to compare the relative scales between the physical attributes of various objects.DoQ (Elazar et al., 2019) provides an empirical distribution over possible values of quantitative attributes.Zhang et al. (2020) tests that NLP models contain information about the scalar magnitudes of physical objects.Although previous studies probed numerical reasoning over numeral and physical attributes, no attempt has been made to investigate reasoning over measurements.

Conclusion
To the best of our knowledge, our study is the first to investigate reasoning over measurements.Our analysis shows that PLMs lack the capability required for reasoning over measurements.We proposed a scale embedding approach that provides information on the position and scale of numbers, and it significantly increases the MST performance.

Limitations
Our scale embedding can make mistakes when the unit itself contains numbers (e.g.mg/100ml).Therefore, scale embedding should not be applied to UoM containing numbers through exception handling.
Our work will be largely affected by the created prompts.If the prompt is not obvious for PLMs to understand, although they have such reasoning ability, they may not give the correct answer.To mitigate this problem, we conducted experiments with different sets of prompts in Section 4 and showed that the results maintain their tendency across the prompts.Despite these efforts, it is still unclear what the optimal choice of the prompt is.We remain this problem as a future work.

A.1 Data Statistics
Table 8 shows the statistics of MSTs we used for experiments.

A.2 Training and Evaluation
The BERT configuration of all models is the same as the base model (L=12, H=768, A=12, Total Pa-rameters=110M) in (Devlin et al., 2019).Maximum sequence length is 512.We trained the model with batch size 256 for 30 epochs.We used the Adam optimizer for training.The learning rate started from 5e-5 and linearly decayed towards 1e-8.We early stopped the training when the validation accuracy did not increase for 2 epochs.The batch size for evaluation is 128, and other settings are the same as training.We found the optimal hyperparameters using the grid search, where we evaluated the learning rate [1e-5, 2e-5, 5e-5, 1e-4], batch size [16,32,64,128].

B More Details of Prompt Sets
The results with both decimal and scientific notation are shown in Table 9.

B.1 LABEL
Inspired by Yuan et al. (2021), we included synonyms as an answer to make the prompt diverse.We used the website https://www.wordhippo.com/ to search for synonyms.Among the search results, we chose two words that match the context.We report the list of synonyms in Table 10.

B.2 CONTEXT
If the context differs from what PLM saw during pre-training, then PLMs will struggle to solve MSTs even if they understand the measuring skills.To mitigate this, we prepared four additional prompts with the same meaning.Additional prompts are listed in Table 11.

B.3 UOM
In the general domain, some UoMs listed in Table 7 rarely appear in the context.For example, international units per liter (IU/l) is frequently used in pharmacology, but not in other scientific articles.Therefore, we can wonder if some rare biomedical units disrupt the understanding of general domain PLMs (e.g., BERT and ALBERT).To answer this question, we replaced all UoMs in the dataset with the commonly used UoMs: g, l, m, and s.

C Additional Results on Rule-based Conversion
Table 12 describes the complete set of MST results after applying rule-based conversion.

D Additional Results on Scale Embedding
Table 13 describes the MST results of scale embedding with decimal and scientific notation.

E Experimental Environment
We trained the models with Google TPU v2-8 and v3-8.We used PyTorch 1.10.0(Paszke et al., 2019) and Huggingface Transformers (Wolf et al., 2020) 4.3.3 for experiments.Table 13: Effect of scale embedding on MSTs.We report the classification accuracy and performance improvement (∆) after applying scale embedding.

Table 1 :
Examples of measuring skill tests (

Table 3 :
Test-set results on MSTs.We report the classification accuracy on interpolation (in) and extrapolation (ex) test dataset.COMP, ARG, SORT, UNIT, and REF are abbreviations of COMPARISON, ARGMIN/MAX, SORTING, UNIT CONVERSION, and REFERENCE RANGE DETECTION, respectively.Sci and Deci stand for scientific and decimal notations, respectively.

Table 4 :
support this.To see whether the results Test-set results on different sets of prompts.We report the classification accuracy and the performance difference (∆).We obtain ∆ by subtracting the results in Table3 from this table.

Table 5 :
Test-set results on rule-based conversion experiments.We report the classification accuracy and the performance difference (∆).

Table 7 :
List of units used for data generation.

Table 8 :
Statistics of MSTs used for experiments.

Table 9 :
Test-set results on different sets of prompts.We report the classification accuracy and the performance difference (∆).We obtain ∆ by subtracting the results in Table3from this table.

Table 12 :
Test-set results on rule-based conversion experiments.We report the classification accuracy and the performance difference.