Beyond Distributional Hypothesis: Let Language Models Learn Meaning-Text Correspondence

The logical negation property (LNP), which implies generating different predictions for semantically opposite inputs, is an important property that a trustworthy language model must satisfy. However, much recent evidence shows that large-size pre-trained language models (PLMs) do not satisfy this property. In this paper, we perform experiments using probing tasks to assess PLM's LNP understanding. Unlike previous studies that only examined negation expressions, we expand the boundary of the investigation to lexical semantics. Through experiments, we observe that PLMs violate the LNP frequently. To alleviate the issue, we propose a novel intermediate training task, names meaning-matching, designed to directly learn a meaning-text correspondence, instead of relying on the distributional hypothesis. Through multiple experiments, we find that the task enables PLMs to learn lexical semantic information. Also, through fine-tuning experiments on 7 GLUE tasks, we confirm that it is a safe intermediate task that guarantees a similar or better performance of downstream tasks. Finally, we observe that our proposed approach outperforms our previous counterparts despite its time and resource efficiency.

However, their reliability is recently being challenged. Many studies have conducted various probing tasks and observed that PLMs exhibit faulty behaviours, such as insensitiveness to sentence ordering (Pham et al., 2021;Gupta et al., 2021;Sinha et al., 2021b), incomprehension on number-related representations (Wallace et al., 2019;Lin et al., 2020;Nogueira et al., 2021), and lack of semantic content understanding (Ravichander et al., 2020;Elazar et al., 2021). These issues raise concerns about PLMs' stability and reliability, precluding them from applications in practice, especially in risk-sensitive areas.
Another critical problem of PLMs is their inaccurate behaviour on negation, which is a principal property in many language understanding tasks. For tasks where the LNP holds (p is true iff ¬p is false; see Aina et al. 2018), PLMs should make different answers for the original and negated inputs. However, several studies observed that PLMs violate this property. In masked knowledge retrieval tasks, PLMs frequently generate incorrect answers for negated input queries (Ettinger, 2020;Kassner and Schütze, 2020). In other studies, PLMs show a poor generalisation ability on negated natural language inference (NLI) datasets (Naik et al., 2018;Hossain et al., 2020).
Although the aforementioned studies produced promising analysis results, they limited the scope of the LNP only to adding negation expressions (e.g., "no" and "not"). However, other perturbations that generate the opposite meaning also can be applied to the property. Therefore, a consideration of such perturbation methods is necessary to fully assess whether PLMs satisfy the LNP.
Also, remedies to alleviate the problem have not been studied much yet. Hosseini et al. (2021) recently employed data augmentation and unlikelihood training (Welleck et al., 2020) to prevent models from generating unwanted words, given the augmented negated data during masked language modelling (MLM). However, this approach has several downsides. First, like previous works, Hosseini et al. (2021) only considered negation expressions. Second, the data augmentation method is contingent on many additional linguistic compo-nents, which causes the dependency of a model's performance on certain modules and precludes applying the method to other languages where such resources are unavailable. Finally, the model should be pre-trained from scratch with the unlikelihood objective, which consumes considerable time and resources.
In this paper, we expand the boundary of the LNP to lexical semantics, i.e., synonyms and antonyms, and ascertain that PLMs are prone to violate the LNP. Next, we propose a remedy, called intermediate-training on meaning-matching (IM 2 ), which hardly employs additional linguistic components. We hypothesise that a leading cause lies in the MLM training objective, which assumes the distributional hypothesis for learning the meaning of the text (Sinha et al., 2021a). Instead, we design a model that directly learns the correspondence between words and their semantic contents. Through experiments, we verify that our approach improves the model's comprehension of the LNP, while showing a stable performance on multiple downstream tasks.
Our main contributions are as follows: (i) We extend the investigation of the LNP from negation to lexical semantics (Section 2), (ii) we reveal that PLMs are prone to violate the LNP (Section 3), (iii) we propose a novel remedy, named IM 2 , which is decoupled from the distributional hypothesis but learns meaning-text correspondence instead (Section 4), (iv) through experiments, we ascertain that the proposed approach improves the understanding of negation and lexical semantic information (Sections 5.1 and 5.2), and (v) we verify that meaningmatching is a stable and safe intermediate task that produces a similar or better performance in multiple downstream tasks (Sections 5.3 and 5.4).

Probing Tasks for Investigating the Logical Negation Property
We design three probing tasks to evaluate whether PLMs satisfy the LNP: masked knowledge retrieval on negated queries (MKR-NQ), masked word retrieval (MWR), and synonym/antonym recognition (SAR). Brief illustrations of each task are in Figure 1.

Masked Knowledge Retrieval on Negated Queries
The MKR-NQ task examines whether PLMs generate incorrect answers for negated queries. Following the work of Kassner and Schütze (2020), we constructed the evaluation dataset by negating the LAMA dataset (Petroni et al., 2019), which contains masked free-text forms of ConceptNet (Speer et al., 2017) triplets and their corresponding answers (e.g., (bird, CapableOf, fly) → ("A bird can [MASK]", fly)). The task aims to generate a correct word through MLM. According to the LNP, a model must not generate the original answer if the query is negated. To measure how likely PLMs generate wrong predictions for negated queries, we collected pairs of (negated_query, wrong_predictions). We selected several relations in the LAMA dataset that ensure mutual exclusiveness between the original and negated queries. 2 For negating sentences, we selected LAMA data points that contain a single verb using the Spacy parts of speech (POS) tagger (Honnibal and Johnson, 2015). Next, we added negation expressions, such as "not" and "don't", or removed such expressions if they existed. Finally, we collected the wrong predictions from Concept-Net by using the head entity and relation. As a result, we collected 3,360 data points for this task. The list of the relations that we used and examples of the data are in Table 10 in Appendix A.

Masked Word Retrieval
To expand the boundary of the LNP to lexical semantics, we design the MWR task, which generates an answer of a masked query, asking for the synonym/antonym of a target word through MLM (e.g., "happy is the synonym of [MASK]").
Let s w and a w denote masked queries that ask the synonym and antonym of the word w, respectively. Also, let A s and A a refer to the list of correct answers for s w and a w , respectively. Intuitively, A a becomes the wrong predictions of s w , because s w and a w have the opposite meaning. Therefore, we can evaluate the violation of the LNP by investigating whether a PLM generates wrong predictions.
To extract commonly-used words for our experiment, we first extracted nouns, adjectives, and adverbs that appear more than five times in the SNLI dataset (Bowman et al., 2015). Among the extracted candidates, we filtered words that have synonyms or antonyms in ConceptNet. Finally, we generated masked queries by employing templates used by Camburu et al. (2020)  templates and examples of the data are in Table 11 in Appendix A.

Synonym/Antonym Recognition
SAR is a classification that distinguishes whether two given words are synonyms or antonyms. It aims to evaluate whether the contextualised representations of PLMs reflect the lexical meaning of words. Therefore, we use a parametric probing model (Adi et al., 2017;Liu et al., 2019a;Belinkov and Glass, 2019;Sinha et al., 2021a) for the experiment. Specifically, the experiment is performed on the final layer of each PLMs, i.e., we only train the classifier while keeping the encoder frozen. We use ConceptNet to build the dataset. ConceptNet has much more synonym triplets compared to antonyms. As a result, we randomly sample the synonym triplets to maintain a balance. To that end, we collect 33K, 1K, and 2K data points for the train, dev, and test datasets, respectively.

Evaluation Metrics
We use the top-k hit rate (HR@k) to evaluate the performance on the MKR-NQ and MWR tasks. Assume that P = {(p 1 , c 1 ), (p 2 , c 2 ), . . . , (p n , c n )} denotes the set of predictions for a data point x, where p t and c t refer to the predicted word and confidence score of the t-th prediction, respectively. Then, the top-k hit rate for a data point x is defined as follows: where W x is the wrong prediction set of x. Intuitively, the metric measures the ratio of top-k predicted words that belong to the wrong prediction set.
To reflect the prediction confidence score to the evaluation metric, we additionally define the weighted top-k hit rate (WHR@k) that uses the confidence score as weights. It is worth to mention that lower metrics mean a better model performance in both cases as the metrics assess how likely the models make inaccurate answers that they must avoid. The weighted metric can be defined as follows: For the SAR task, we employ accuracy as an evaluation metric, because each data point has its own label, and the label distribution is not skewed.

PLMs Lack Information of Negation and Lexical Semantics
We select the following PLMs for the experiments: bidirectional encoder representations from transformers (BERT)-base/large (Devlin et al., 2019), RoBERTa-base/large (Liu et al., 2019b), and ALBERT-base/large (Lan et al., 2019). These PLMs are pre-trained with the MLM training objective. We added the ELECTRA-small/base/large models (Clark et al., 2020) for the SAR task, but it is not used for the MKR-NQ and MWR experiments, as the discriminator of the ELECTRA models are trained with the replaced token prediction (RTP) training objective and have no MLM classifier. No additional training is required for the MKR-NQ and MWR tasks. For the SAR task, we fine-tune each PLM for 10 epochs and apply the early stopping technique. We use the AdamW optimiser (Loshchilov and Hutter, 2019) for training with a learning rate of 5e −6 and a batch size of 32.

Results for MKR-NQ
The results for the MKR-NQ task are summarised in Table 1. In general, the results are consistent with previous works (Ettinger, 2020;Kassner and Schütze, 2020). We observe three important characteristics from the experimental results.   First, large models produce a higher hit rate than their corresponding base-size models in all three PLMs, recording an average of about 1.5 times higher values. This implies that large-size models are more likely to generate wrong predictions for negated queries, even though they perform better than small-size models in many benchmark tests. The results suggest that evaluating a model's performance solely based on the accuracy metric is unwise.
Second, the hit rate decreases as k increases, which implies that the majority of PLMs' top predictions (e.g., k=1 or k=2) are incorrect. Finally, the weighted hit rate is much higher than the vanilla hit rate, suggesting that PLMs generate wrong predictions with high confidence.

Results for MWR
The results of the MWR task are summarised in Table 1. The three characteristics found in the MKR-NQ task are also observed in the MWR task. Also, we found the following additional patterns.

PLMs lack knowledge of antonyms.
In general, the hit rates are extremely high compared to the MKR-NQ task in all the PLMs. Analysing their predictions, we find that PLMs generate incorrect predictions primarily in antonym-asking queries. Specifically, the average HR@1 of the antonymasking queries is 41.9%, while that of the synonymasking queries is only 1.4%. A leading cause is that PLMs simply replicate the word presented in the input query. Table 2 shows the ratio of instances where each PLM reproduces the same word in a question. While the values are quite high for both synonym-asking and antonym-asking queries, the problem is more severe in the latter case, because the generated predictions are definitely incorrect. Based on our results, we conclude that PLMs' contextualised representations lack lexical semantic information. Our conclusion is in line with the findings of Liu et al. (2019a) showing that encoderfixed PLMs are not suitable to deal with tasks that require fine-grained linguistic knowledge.
Issues are more severe with nouns.
We observe that the hit rates are higher when a word in a question is a noun. Specifically, the average HR@1 values of nouns, adjectives, and adverbs are 35.1%, 27.4%, and 11.8%, respectively. Interestingly, PLMs have a high error rate when dealing with nouns even though they are trained with a large written English corpus, where nouns form the greatest portion (at least 37%) of all POS tags (Hudson, 1994;Liang and Liu, 2013).

Results for SAR
As part of the comparison, we fine-tune each PLM on the SAR task, i.e., train the entire set of pa-rameters. The results are summarised in Table 3. We observe a huge gap between the performance of fine-tuned models and that of encoder-fixed models. In contrast to the fine-tuned models that produce a high accuracy, encoder-fixed models fall short of expectations, even recording almost a random guess performance in BERT models. Also, just as a common belief, large models' performance is greatly improved when fine-tuned. However, the difference between the large and small encoderfixed models is insignificant, except for the ELEC-TRA models that exhibit only a marginal improvement. The two phenomenons suggest that PLMs' outstanding performance is predicated on updating many parameters to learn syntactic associations presented in training data (Niven and Kao, 2019;McCoy et al., 2019), but their contextualised representations do not carry abundant lexical meaning information.

Intermediate Training on Meaning
Matching Task: IM 2

Issue of PLMs
Through the previous experiments, we observe that PLMs contain little information about negation and especially lexical semantics. We hypothesise a leading cause lies in the training objective of PLMs: the language modelling (LM) objective, which is a backbone pre-training task of almost all PLMs.
In the LM objective, words are generated based on given contexts. The distributional hypothesis (Harris, 1954), which assumes that semantically related or similar words will appear in similar contexts (Mrkšić et al., 2016), is the underpinning assumption of the LM objective (Sinha et al., 2021a). Under this assumption, a model learns the meaning of texts based on their correlation to others. This is a great benefit, because a model can learn the meaning of texts using only the text form, allowing unsupervised training. Based on this advantage, many unsupervised representations, such as Word2Vec (Mikolov et al., 2013), Glove (Pennington et al., 2014), and current PLMs, have been developed.
However, the problem is that the distributional hypothesis has limitations in reflecting a word's semantic meanings, because words having different or even opposite semantic meanings can appear in similar or the same contexts. For instance, consider the two words "boy" and "girl". We can readily imagine sentences in which the two words appear in the same context, e.g., "the little boy/girl cud-dled the teddy bear closely". As a result, a model can learn their common functional meanings, i.e., young human beings, and the vector representations would be very similar if they were trained based on the distributional hypothesis. However, the representation hardly captures their semantic antonomy, e.g., gender. Similarly, negated sentences have almost identical contexts to their original forms. As a result, models cannot effectively learn the semantic meaning of words and negation expressions, provided they leverage only the text forms.

Meaning-Matching Task
In the light of meaning-text theory, there is a correspondence between linguistic expressions (text) and semantic contents (meaning) (Mel'čuk and Žolkovskij, 1970;Milićević, 2006). Instead of solely relying on the distributional hypothesis, we propose the new meaning-matching task, which can directly learn the correspondence. Specifically, meaning-matching is a classification that takes a word and a sentence as input and determines whether the sentence defines the word correctly. Through this task, a model can learn both meaning-text correspondences and correlations between a word and other words in a definition, which is rarely found in general corpora.
For training PLMs on our new task, we apply the intermediate-training technique (Phang et al., 2018;Wang et al., 2019a;Liu et al., 2019a;Pruksachatkun et al., 2020;Vu et al., 2020), which first fine-tunes PLMs on an intermediate task, and then fine-tunes the model again on target tasks. It has been shown that training on intermediate tasks that require high-level linguistic knowledge and inference ability could improve performance (Liu et al., 2019a;Pruksachatkun et al., 2020). Furthermore, it is more efficient in time and resources than pretraining models on large corpora (e.g., BERTNOT model (Hosseini et al., 2021)).
Dataset. We collect about 150K free-text definitions that depict the meaning of English words from WordNet (Miller, 1995) Table 4: Results of BERT-large and RoBERTa-large after applying the IM 2 approach. We multiply 100 to each value for a better readability. Note that the lower the values the better.

Training details.
It is necessary to generate false word-definition pairs to train PLMs on the meaning-matching task. To achieve this, we use a negative sampling technique. We investigate the proper k in the range of 3, 5, 10, and 20. For a hyperparameter search, the performance of the RoBERTa-base model on the SAR task is used as a criterion. Figure 2 illustrates the SAR performance  of the RoBERTa-base model with different k values. Intuitively, a large k value will lead the model to a better performance by investigating more wordmeaning combinations. However, we observe that the model performs the best when k is 10, and the performance decreases if k is too large. We conjecture that a leading cause is that the dataset contains many words with similar meanings, mostly derived from the same stem. As a result, large k values can increase the possibility of recognising the meaning of such similar words as different.
To avoid the class-imbalance issue in a batch, we duplicate the correct word-definition pairs k times when we construct the training data. For training, the AdamW optimiser is used with a learning rate of 5e −6 . We use 5% of data points for validation and train the models for 15 epochs with a batch size of 32. The early stopping technique is used to prevent overfitting.

Experiments and Results
We conduct the same probing tasks after the intermediate training on the meaning-matching task. 4

SAR Results
We first focus on the SAR task. After the intermediate training, all models are fine-tuned on the SAR task with the same hyperparameters described in Section 3. The results are summarised in Table 5.

Improved lexical semantic information.
We generally observe marginal or no significant improvements when fine-tuning the whole parameters, especially for large-size PLMs. However, with fixed encoder, the performance is significantly improved for PLMs with more than 100M parameters, and the improvements are more significant for large PLMs. Our results show that the proposed approach assists PLMs to learn enhanced representations with more abundant lexical semantic information.

Catastrophic forgetting.
We find that small PLMs, such as ELECTRA-small and ALBERT models, show no significant increase in performance or are negatively impacted. Because all PLMs achieve a comparable performance on the meaning-matching task, we hypothesise that a leading cause is catastrophic forgetting (Pruksachatkun et al., 2020;Wallat et al., 2020), where the model forgets previous knowledge learned through pretraining to accept new information from the intermediate task. To verify this, we measure the change of parameter values after IM 2 . Concretely, let M i and M mm i denote the parameter of i-th layer before and after IM 2 . We calculate the average Frobenius norm for each layer: Figure 3 shows the boxplots of F i for each PLMs. We observe that the parameters of the ELECTRA-small model, which is negatively impacted, are changed considerably compared to other PLMs having parameters more than 100M. The results suggest that the size of PLMs is an important property to prevent the catastrophic forgetting issue.

MKR-NQ and MWR Results
Next, we perform the MKR-NQ and MWR tasks after applying the IM 2 method. Since our models are not trained with the MLM objective, we replace the encoder of original PLMs with that of the models after fine-tuning on the meaning-matching task and reuse the MLM classifier. For the experiments, we use BERT-large and RoBERTa-large, because they are pre-trained based on the MLM objective, and parameters are hardly changed after applying the IM 2 method. The results are summarised in Table 4.
We observe substantial decreases in the hit rates of incorrect predictions in both PLMs. For the MWR task, we find that the issue of regenerating a word in a given query is greatly relieved after applying the IM 2 method. Specifically, the percentage of such instances drops from 40.3% to 19.6% and from 33.8% to 25.2% for BERT-large and RoBERTa-large, respectively. Several examples of the predicted results are presented in Table 6. The results lend support to our claim that the IM 2 approach is of benefit to learning lexical semantic information and the meaning of negated expressions.

Fine-Tuning on the GLUE Benchmark
A critical drawback of intermediate training is that the target task performance could be negatively impacted if the intermediate task is not related to the target task (Liu et al., 2019a;Pruksachatkun et al., 2020). To confirm whether the issue occurs, we compare the performance of BERT, RoBERTa, and ELECTRA-large on 7 GLUE benchmark datasets (Wang et al., 2018) with their IM 2 counterparts. We train the models for 10 epochs for each dataset and apply the early stopping technique where the patience number is set to 3. It is observed that the training is generally finished within 8 epochs for all the models. The batch size per GPU and learning rates used for each dataset are described in Table 8    MNLI, QNLI, and QQP) were not sensitive to the hyperparameters.
The results are presented in Table 7. We find no significant difference in performance for tasks with large datasets, such as MNLI, QNLI, QQP, and SST2. On the contrary, tasks with small datasets, like MRPC and RTE, are slightly improved. The result is consistent with Pruksachatkun et al. (2020) and Vu et al. (2020), which showed that smaller tasks benefit much more from the intermediate training. Furthermore, unlike the previous studies that observed a negative transfer with the COLA dataset (Phang et al., 2018;Pruksachatkun et al., 2020), the performance is improved in our approach. The result suggests that meaning-matching is a safe intermediate task that ensures a positive transfer with target downstream tasks.

Experiments on the NegNLI Dataset
Finally, we conduct experiments on the NegNLI benchmark dataset (Hossain et al., 2020), where negation plays an important role for NLI tasks. As a baseline, we compare the reported performance of BERTNOT (Hosseini et al., 2021), which is a recently proposed remedy to improve PLMs' ability to understand negation. Since Hosseini et al.
(2021) used BERT-base as a backbone model, we also apply the IM 2 method to BERT-base. The results are summarised in Table 9.
For both SNLI and MNLI, we observe that our approach outperforms BERTNOT in the NegNLI datasets, while yielding a comparable performance in the original development datasets. It is interesting that our approach improves the understanding of negation in both MKR-NQ and NegNLI tasks. We conjecture that a leading cause is that the definitions of the meaning-matching dataset contain many negation expressions, which enables a model to learn their proposed meaning (see Table 12). The results suggest that our proposed approach is more efficient than BERTNOT, because the IM 2 method leverages less time and resources for training.

Related Work
PLMs are at the core of many success stories in natural language processing (NLP). However, it remains unclear to what extent PLMs understand the syntactic and semantic properties of the human language. A series of probing tasks have been conducted on PLMs and have found them lacking or falling short on some language properties. Among the many findings of these probing tasks, PLMs have been found to be insensitive to the order of sentences when generating representations (Pham et al., 2021;Gupta et al., 2021;Sinha et al., 2021a), struggle to comprehend number-related representations (Wallace et al., 2019;Lin et al., 2020;Nogueira et al., 2021), and display a lack of semantic content understanding (Ravichander et al., 2020;Elazar et al., 2021).
In addition to the above faulty behaviours, Ettinger (2020) and Kassner and Schütze (2020) show that PLMs fail to comprehend negation, which is an important property of language in many natural language understanding (NLU) tasks. Ettinger (2020) check the ability of PLMs to understand the meaning of negation in given contexts. In their work, they check whether models are sensitive in their completions of sentences that either include negation or not. Under normal circumstances, the completions are expected to vary in truth depending on the presence or absence of negation in given sentences. Their results show that PLMs are insensitive to the impacts of negations when completing sentences. Kassner and Schütze (2020) construct the negated LAMA dataset by inserting negation elements (e.g., "not") in the LAMA cloze questions (Petroni et al., 2019). They use negated and original question pairs to query PLMs and establish that models are equally prone to make the same predictions for both the original and negated questions. In a well-informed setting, it is expected that PLMs should make different predictions for the original and negated questions. This shows that PLMs struggle to comprehend negation.
In light of the highlighted faulty behaviours of PLMs, especially their struggle to comprehend negation, Hosseini et al. (2021) propose a remedy to alleviate the problem. In their remedy, they augment the language modelling objective with an unlikelihood objective (Welleck et al., 2020) based on negated sentences from the training corpus. They use a syntactic augmentation method to generate negated sentences. In this method, the dependency parse of the sentences, POS tags, and morphological information of each word are taken as input, and the negation of sentences is done using sets of dependency tree regular expression patterns, such as Semgrex (Chambers et al., 2007). During training, they replace objects in negated sentences with [MASK] tokens and use unlikelihood training to make the masked-out tokens unlikely under the PLM distribution. To ensure that negated sentences are factually false, they use the corresponding positive sentences as context for the unlikelihood prediction task.
Previous studies (e.g., Kassner and Schütze (2020)) have mostly limited the scope of the logical negation property only to the negation expressions (e.g., "no" and "not"). However, the core spirit of this property is the opposite meaning, which is not only limited to the negation. Welleck et al. (2020) consider negating sentences using dependency tree regular expression patterns. This widens the scope of negation, as it is not only limited to the negation expressions "no" and "not". However, their approach relies on other components, such as Semgrex, and dependency and POS parsers, which could impact the quality of the data, hence impact the models' performance. In this work, we consider other perturbation methods to generate the opposite-meaning sentences to investigate whether PLMs satisfy the logical negation property, and we propose a remedy, called intermediate-training on meaning-matching (IM 2 ), which hardly employs additional linguistic components.

Summary and Outlook
In this work, we investigated PLMs' LNP. Compared to previous works that only examine negation expressions, we expanded the boundary of LNP to lexical semantics. We confirmed that PLMs are likely to violate LNP through extensive experiments.
We hypothesise that the distributional hypothesis is an insufficient basis for understanding the semantic meaning of texts. To alleviate the issue, we proposed a novel intermediate task: meaning-matching. Via experiments, we verified that meaningmatching is a stable intermediate task that substantially improves PLMs' understanding of negation and lexical semantic information while guaranteeing a positive transfer with multiple downstream tasks. Also, our approach produces a better performance on the negated NLI datasets compared to the unlikelihood training-based method, which leverages much more time and resources. Our work suggests that it is time to move beyond the distributional hypothesis to develop logically consistent and stable language models.