Where’s the Learning in Representation Learning for Compositional Semantics and the Case of Thematic Fit

Observing that for certain NLP tasks, such as semantic role prediction or thematic fit estimation, random embeddings perform as well as pre-trained embeddings, we explore what settings allow for this, and examine where most of the learning is encoded: the word embeddings, the semantic role embeddings, or “the network”. We find nuanced answers, depending on the task and its relation to the training objective. We examine these representation learning aspects in multi-task learning, where role prediction and role-filling are supervised tasks, while several thematic fit tasks are outside the models’ direct supervision. We observe a non-monotonous relation between some tasks’ quality scores and the training data size. In order to better understand this observation, we analyze these results using easier, per-verb versions of these tasks.


Introduction
We examine to what extent models trained on a simplified semantic role labeling (SRL) task can estimate thematic fit (aka semantic fit), as the training set size grows -and where most of the learning is stored: in the word embeddings, the thematic role embeddings, or elsewhere in the neural net.
A major goal of natural language processing (NLP) is to understand the semantics of language. One traditional NLP task around this is SRL, which labels word spans in a sentence with thematic roles. Consider the sentence "I cut the cake with a knife". We can interpret 'cut' as the action, 'I' as the Agent (the performer of the action), 'cake' as the Theme of the action (the thing that underwent the action), and 'knife' as the Instrument of * These authors contributed equally to this work the action. These words, labeled with roles such as Agent, Theme and Instrument, would be our representation of the event that the sentence conveys. Other sentences with similar meanings, e.g., "the cake was cut with the knife by me", should have the same (or very similar) event representations. In this work, we focus on model training with a simplified version of SRL: each event is represented only by the lemmatized syntactic head of each event argument (including the predicate), and the semantic roles are the simplified PropBank roles (Arg0, Arg1, etc.). The reason for this is the current limitations of available evaluation sets for thematic fit: they are all comprised of lemmatized syntactic argument heads as well.
Thematic fit is related to SRL, but separate. This task aims to identify how well a given word or concept fits into a role of an event. Going back to our example sentence, consider these potential replacements for 'knife': scissors, fork and brick. As humans, we understand that while 'knife' is the most typical object for this situation, both 'scissors' and 'fork' could also fit, even if not as naturally. This is because we have a construct of all three objects being instruments for cutting. More so, we know that 'brick' is unlikely to fit given the context of cutting a cake. Since thematic fit datasets are scarce, one challenge in computational linguistics (and computational psycholinguistics) revolves around how machine learning models can learn thematic fit indirectly -perhaps from SRL training. To the best of our knowledge, the state-of-art in this line of work is the residual role-filler averaging model (ResRoFA-MT) proposed by Hong et al. (2018), with an adjusted embeddings representation and training data annotation in Marton and Sayeed (2022).
In this paper, we examine training set size effects on thematic fit tasks -for which the models were not directly optimized -even after reaching a plateau on the simplified SRL task and its complementary task (predicting the head word given the role). 1. We find surprising training set size interactions with specific evaluation sets and design a modified evaluation metric in order to better understand these interactions. 2. We also modify the ResRoFA-MT model architecture in various ways to understand what contributes the most to the learning: the pretrained (or random) word embeddings, the thematic role embeddings, or the rest of the network. 3. In order to be able to train on larger data, we optimized the code of Hong et al. (2018) and Marton and Sayeed (2022). We release our optimized codebase * , which trains 6 times faster and includes ablation architectures and a correction to the training data preparation step.

Related Work
In event representation models, the main goal is to predict the appropriate word in a sentence given both the role of that word and the surrounding context in the form of word-role pairs. One of the best early neural models was the non-incremental rolefiller model (NNRF), by Tilk et al. (2016). This model was based on selectional preferences, or a probability distribution over the candidate words. However, one drawback of this model is that representations of two similarly-worded sentences differing hugely in meaning would closely resemble each other, e.g., "kid watches TV" and "TV watches kid". Another drawback is that the embeddings of the word-role pairs are summed together to represent the sentence, and so the resulting event representation vector does not weight the input vectors differently based on their importance and is not normalized for varying numbers of roles in a sample. Hong et al. (2018) extend this model in three ways: First, in addition to the word prediction task of NNRF, the task of role prediction given the corresponding word is added, and the two tasks are trained simultaneously (multi-task learning). This model is known as the non-incremental role-filler multitask model (NNRF-MT). Second, they apply the parametric rectified linear unit (PReLU) non-linear function to each word-role embedding, which acts as weights on the composition of em-beddings, and subsequently average the embeddings, which normalizes for variable length inputs. This model is called the role-filler averaging model (RoFA-MT). Third, in an effort to tackle the vanishing gradient problem, residual connections between the PReLU output and the averaging input were added together. This third iteration is known as the ResRoFA-MT model. They showed that it performs the best on our thematic fit tasks, and so we use it as our baseline.
Our work differs from Hong et al. (2018) and Marton and Sayeed (2022) in that while they focused more on state-of-the-art performance improvement through new modeling and annotation methods, we aim to understand what controls the learning in such networks.
Previous work suggests a difference between "count" and "predict" models, where "count" models represented lexical semantics in terms of raw or adjusted unsupervised frequencies of correlations between words (such as Local Mutual Information; Baroni and Lenci, 2010) and syntactic or semantic phenomena; "predict" models involve supervised training to achieve their representations, e.g., neural models. Baroni et al. (2014) do a systematic exploration of tasks vs. state-of-the-art count and predict models and found that predict models were overall superior; for thematic fit, predict models were the same or better than count models on the best unsupervised setup for the task, although they were easily beaten by third-party baselines based on supervised learning over count models. More recently, Lenci et al. (2022) demonstrate that predictmodels are not reliably superior to count-models, but depend on the task and the way the models are trained. They also show that even recent contextual models such as BERT are not necessarily better for out-of-context tasks than well-tuned static representations, predict or otherwise. See the Appendix for details on why not use BERT here.

Datasets
We use the Rollenwechsel-English, Version 2 (RW-Eng v2) corpus (Marton and Sayeed, 2022) as the training set for all our experiments. This corpus is sentence-segmented, annotated with morphological analyses, syntactic parses, and syntax-independent PropBank-based semantic role labeling (SRL). The syntactic head word of each semantic argument is determined by using several heuristics to match the parses to the semantic argument spans. Note that a sentence may have multiple predicates (typically verbs) and therefore multiple semantic frames (sometimes called "events"), each with its own semantic arguments, whose span may overlap the argument span of other frames in the sentence.
The first version of this corpus contained NLTK lemmas, MaltParser parses, parts-of-speech (POS) tags, and SENNA SRL tags (Bird, 2006;Nivre et al., 2006;Collobert and Weston, 2007). The second version added layers from more modern taggers: Morfette lemmas, spaCy syntactic parses and POS tags, and LSGN SRL tags (Chrupala, 2011;Honnibal and Johnson, 2015;He et al., 2018). In our experiments here we use the lemmas of the semantic arguments' head words in v2.
The sentences themselves are taken from both the ukWaC (Ferraresi et al., 2008) and the British National Corpus (BNC). This corpus contains 78M sentences across 2.3M documents. This includes 210M verbal predicates with 700M associated role-fillers. We use the same training, validation, and test split as Hong et al. (2018). That is, we have 99.2% ( 201.5M samples) in the full training set, 0.4% in validation, and 0.4% in testing. We run our training experiments on different subsets of the training data, ranging from 1% up to the full dataset. We cap our vocabulary size at the 50,000 most common words in that specific subset.
We used the following psycholinguistic test sets: Padó (Padó et al., 2006) 414 verb-argument pairs and the associated judgement scores. These were constructed from 18 verbs that are present in both FrameNet and PropBank. For each verb, the three most frequent subjects and objects from each of the underlying corpora were selected. That yielded six arguments per verb per corpus, with some overlap between corpora. For each verb-argument pair, a judgement was collected online with an average of 21 ratings per item for the argument in subject and object role. The rating was collected on a Likert scale of 1-7 with the question "How common is it for [subject] to [verb]?" or "How common is it for [object] to be [verbed]?" et al., 1998) 1444 pairs of verbargument pairs in a similar format to Padó. These were created using a similar rating question as the Padó dataset, but is a compilation of ratings collected over several studies with considerable overlap and heterogeneous selection criteria.

Modeling and Methodology
In this setup, an input event is represented as role-word pairs, where the role is one of the following PropBank (Palmer et al., 2005) roles: Arg0, Arg1, ArgM-Mnr, ArgM-Loc, ArgM-Tmp, and the predicate. The word is the argument's syntactic head's lemma. Both the role and the head word are taken from RW-Eng v2. * All prior works with the ResRoFA-MT model use two random word embedding sets (one for input words and one for the target word) and similarly two role embedding sets. See Figure 1a.
Our implementation differs in these key aspects: • Modified model architecture -Using a single word embeddings set, shared between the target and input words, and similarly a single role embeddings set ( Figure 1b). In our experiments, we find the non-shared, redundant embedding layers do not affect the performance while adding (vocab size 50,000 × word embedding size 300) 15,000,000 learnable parameters in the model. • Changes in Batching -With previous implementations, one "epoch" only resulted in about a third of the data being traversed. The next epoch would start on the second third and so on. Now, we set the data preprocessing so that "one epoch = one pass through all the training data". Additionally, the data is preprocessed during the training of each batch, s o no time is lost during training in waiting for the next batch of data to be preprocessed.
• Missing and unknown words handling -Following Marton and Sayeed (2022) but unlike Hong et al. (2018), we represent outof-vocabulary (OOV) words separately from missing words (empty slots in an event).

Experiments and Discussion
It has been repeatedly observed that in some settings, random word embeddings perform as well as pretrained ones, or very nearly, including in our baselines (Tilk et al., 2016;Hong et al., 2018;Marton and Sayeed, 2022). We design experiments to answer the following questions: Q1. Why is this so in our compositional semantics and psycholinguistic tasks? Q2. For such semantic tasks and architecture, where is the learning encoded? Is it in the word embeddings, role embeddings or "the network"? Q3. Training set size effect: is more data better for this indirect setting and tasks?

Objective and Evaluation
We train a feed-forward network in a multi-task learning setting to optimize word and role prediction accuracy. For target word prediction we give the prediction layer a context vector formed as a multiplication of the input word-role pairs and the target role. Similarly, for target role prediction we feed the same context vector along with the target word, following the ResRoFA-MT architecture (Hong et al., 2018) (Figure 1a). Since the network initialization is random, we perform 5 runs of each experiment and report the mean with a 95% confidence interval. Following Hong et al. (2018); Marton and Sayeed (2022), we test each model on the psycholinguistic datasets (Section 3), for which the models were not directly optimized. The idea behind using the latter test battery is that the model, even though trained on (simplified) SRL and word prediction (aka role-filling) tasks, is expected to be able to make indirect generalizations about predicate-argument fit level from the training data and the related objectives. These psycholinguistic tasks are evaluated with Spearman's rank correlation between the sorted human scores and the sorted model scores, except for Bicknell, for which we take accuracy of predicting which argument in each Patient role-filler pair is (more) congruent (Lenci, 2011).

Shared Embedding Layer
We modify the network to use a single embeddings set shared between the input words and target word, by using a single index-to-embedding mapping layer -and similarly a shared embedding-mapping layer for the input roles and target role (Figure 1b). This change results in 2x the training speed (Section 4) without degradation in performance (see first two rows in Tables 1 and 2). Therefore we use the faster shared architecture for the rest of the experiments. We train all models (until Section 5.6) on a uniformly sampled 1% subset, which is large enough to get indicative results while saving time and cost in experimentation. For comparison of our results to previous work, see Section 5.6.

Random vs. Pre-trained Embeddings
Hong et al. (2018) used random Glorot uniform to initialize the word embeddings. Private communication with the authors confirmed random embeddings do as well as pretrained ones for these tasks. We replicate this finding, comparing random word embeddings to pretrained GloVe embeddings (Pennington et al., 2014), both of size 300. See the third row in the top part of Tables 1 and 2.
(Q1) Why is this so? We note that during training, embeddings get updated. To check if this update is responsible for bridging the gap between zero knowledge (random embeddings) and much knowledge (compressed in the pre-trained GloVe), we freeze the word embedding layer and rerun the experiments (see the middle part in the same two tables). Contrary to our previous experiment, we find fixed GloVe embeddings do much better than fixed random embeddings on all our tasks. We also see tuning helps the network converge much faster (from 25 epochs down to 11-15).
We conclude that indeed much of the learning is captured in the word embeddings. Tuning them even on only 1% of our training data bridges the knowledge gap from the pre-trained embeddings almost completely (with possible exceptions on Ferretti and Bicknell). But we note that although lower, the fixed embeddings results are not near-random. This leads us to Q2.: Where else is learning done, and to what extent?

Role Contribution
We now turn to role ablation tests. First we take away the input roles from the context embeddings and call this the no-input-roles network NIR (see  Tables 1 and 2). We do not see significant drops in most of the tasks except role prediction, which we expect by construction. Note that when predicting the target word, the NIR network still receives the target role information,which, together with at least the predicate, is likely often sufficient information for prediction.
We find it surprising that input role ablation barely affects performance on the psycholinguistic tasks. Why is that? One possibility is that the input role contribution is negligible. But another possibility is that in NIR, all (or almost all) the role   Table 2: Thematic Fit tests on 1% training data (same models as in Table 1) information had to be 'crammed' into the target role embeddings. To tease these apart, we next take away the target role from the penultimate layer of the network, but leave the input roles intact. We call this no-target-role network NTR (see Figure 1d and the row after NIR in the same tables). Now the role accuracy goes back to the base level (as expected by construction), but word accuracy as well as performance on the psycholinguistic tasks drop. We conclude that target role carries more crucial information than input roles for our psycholinguistic tasks, and that role information 'cramming', if it happens in NIR, does not happen in the other direction (NTR). Finally, for completeness, we remove all role information from the network. We call this no-role network NR (see Figure 1e and same tables). This results in a drastic drop in word accuracy as well as the psycholinguistic tasks. This is an an interesting finding which supports previous knowledge about the importance of roles in multi-task learning setting while at the same time defies the importance of roles in the context vector (the output of the resid-ual block in Figure 1). Next, we turn to learn more about the impact this vector and the block it is in.

"It's the Network!"... Or is it?
In order to see how much the particular ResRoFA-MT model architecture (aka "the network") contributes in our tasks, we first use the finetuned GloVe embedding from a previously trained base model (third row in Table 1) and assign the rest of the network random weights ("RAND Network" in Tables 1 and 2). To ensure the random weights are similar in size to the trained weights, we calculate the mean and standard deviation for each layer separately and assign that layer random weights using a Gaussian distribution with the same parameters. We see this new model does very poorly, near random prediction. This could be due to the learned representation in the network weights that were ablated here but also due to incompatibility of the non-trained random network weights with the very informative word embeddings. Therefore next we replace the complex middle "residual block" with a plain dense projection layer but let this ("Simpler Network" (Figure 1f, Tables 1 and 2) learn during training. In training here we use the fine-tuned word (and role) embeddings from our base model. Curiously, we see a large jump in role accuracy, but a drop in word accuracy as well as psycholinguistic tasks other than Bicknell's. We can only speculate as to why the latter task is an outlier here. It involves comparing the plausibility of two two-participant events with one participant changed. A simpler network may have an easier time representing binary distinctions within a pair of simple events, as opposed to predicting finegrained scores of more complex inter-relationships, evaluated by the use of Spearman's ρ in the other datasets. It may even be able to rely on general collocation statistics here, regardless of roles, but we leave this for future work. Note that here, we still do multi-task prediction as before, but in a much simpler network.
This, along with the role ablation experiments, suggest that while the potential incompatibility of the non-trained random network weights with the word embeddings may account for some of the drop in performance, the context vector formation through multiplication and likely also the improvements implemented in our base model have a large impact on the representation learning as tested on the thematic fit tasks (although not the same impact on role/word prediction).
We see again that there is no clear correlation between the increase in directly optimized for role/word prediction, and the performance on the psycholinguistic tasks for which the models were not directly optimized.
To recap, it seems that the answer to Q2 is nuanced: Padó and McRae are most sensitive to ablated roles; GDS, and perhaps also Bicknell, to non-tuned random word embeddings; Ferretti to ablated (simplifiled) networks; and all are sensitive to RAND Networks, but Bicknell is surprisingly robust even there.

Training Data Size Effect
Often in machine learning and NLP, models learn better with more data. However, there are typically diminishing returns. To test the effect of training data size, we use our shared layer network with tuned GloVe embeddings (as in row 3 in Table 1) on uniformly sampled 1%, 10%, 20% 40% and 100% of the training dataset. See Table 3 and Table 4.
First, in order to compare fairly with previous  work, we report the average of the maximum value in each training trial on 20% of the data. (Recall that our 20% of the data is a larger training set than our baselines' 20% due to improvements in our batcher). Our role accuracy is better than Hong et al. (2018) and similar to Marton and Sayeed (2022). Our word accuracy is a bit lower than the latter. On the indirectly supervised thematic fit tasks, our results are better on Padó, similar on McRae, but lower for the rest. We suspect that in the previous work authors reported the best of all the epochs from all trials, which can explain why the previously reported scores are higher than our results; but we could not verify that. In order to better understand the effect of training set size (Q3), we use next what we believe to be more realistic numbers: the average of the last saved model in each run (best model per our validation set) in each training subset size.
We see incremental improvements from 0.1% to 1% to the 10% dataset across all our evaluation tasks; however, contrary to our null hypothesis, we see diminishing returns or no gains in role and word prediction when using 20% or more of the training set. In most of the psycholinguistic tasks (Table 4), results plateau at 10% or 20% with the notable exception of Padó and McRae, where we see a negative trend beyond 20%. Why is it so, and only for these two tasks, with mainly Padó? The Padó dataset is constructed from high-frequency fillers. It behaves differently from the other datasets and gets a high maximum average score on the 20% subset probably because there is more training data available for high-frequency fillers, compared to the other datasets, including McRae. Considering the small samples in these test sets, they might quickly become victims of not only high variance,  Table 4: Thematic Fit with GloVe tuned (same models as in Table 3) † 1 trial had an outlier score .2026 ‡ All experiments had 5 runs per training subset, except for the 100% with only 2 runs, due to compute resource limitation.
but also of overfitting, that is to say, the models may specialize on the corpus distribution, increasingly with training set size. This distribution is likely to be different from the WSJ distribution, from which Padó dataset is drawn (but see also Section 5.7).
How do word/role prediction and thematic fit tasks relate to each other? We leave this question for future research, but our hypothesis is that psycholinguistic meaning of natural language is grounded in interaction with other modalities (e.g., actions, vision, audio), which a model cannot learn just from more textual training data.
This leads potentially to a much bigger question: how much can a neural model learn natural language by just being trained on very large corpora or billions of parameters, and where is the saturation point? Furthermore, we see role information is important to our psycholinguistic tasks; how much does the role definition and granularity (e.g., Prop-Bank or FrameNet), or the role set size, matter for these tasks? Possibly, with a richer roleset, we may see more alignment between word / role prediction and the psycholinguistic tasks. Perhaps PropBank roles are too coarse-grained to allow for an analysis of how a role-prediction task relates to a thematic fit task, which involves the fine-grained ranking (via Spearman's ρ) of event plausiblities derived from the underlying semantic characteristics of the nouns and verbs involved. If so, understanding how performance on a role-prediction task relates to thematic fit judgements may not be possible without a finer-grained inventory of semantic characteristics, such as Dowtyan proto-roles (Dowty, 1991).

Global and Local Correlation
We evaluate both Padó and McRae by computing Spearman's rank correlation between the sorted list of model's probability scores and the sorted list of averaged human scores, for each dataset. Why do Padó and McRae deteriorate with increasing training data size? To test if this is due to fluctuation of model scores for unrelated but near-in-score verbnoun pairs, we averaged correlations for "local" subsets, grouped by verb. This should be an easier task, since some of the globally close competition is not present in each by-verb subset. Indeed, we see high jumps of 5-8% for the "local" correlation scores in the larger subsets (40% and 100%). But in the smaller subsets we see changes of 2-3% up or down. Moreover, the trend of lower correlation with larger training sets remained. We leave it to future work to dig further into why Padó and McRae show such an anomaly.

Conclusions and Future Work
In this work, we explored why random word embeddings counter-intuitively perform as well as pretrained word embeddings on certain compositional semantic tasks (some being outside the models' explicit objective), where the learning is actually stored (teasing apart the word embeddings, role embeddings, and the rest of the network), and how training set size affects performance on these tasks. We found out that tuning (or further tuning) the word embeddings helps and can bridge the gap between random and pretrained embeddings. Moreover, our tuned embedding space is different from pretrained embeddings like GloVe. We saw that the target role is more important than the input roles on our tasks. Furthermore, our experiments suggested that much of the learning happens also in the rest of the network outside word and role embedding layers. No single factor (word / role embeddings or the network) is most important for all tasks.
Training set size had a surprising negative effect on Padó and McRae beyond 20% of the training data. We attempted explaining this with an alternative evaluation method, but this remains to be explained further.
We release our code, including our preferred network architecture -a modified version of ResRoFA-MT with shared embedding layers.
One avenue in which we want to invest is to better understand the complex relationship between word/role accuracy and our psycholinguistic tasks. While our initial hypothesis was that training the network to minimize loss on word/role prediction would also optimize performance on all our tasks, this did not always hold. We suspect that the groundedness is the missing link for (artificially and naturally) learning psycholinguistic tasks, and therefore adding grounding seems promising to us.
Another future avenue for us is investigating the high variability in performance on psycholinguistic tasks, compared to the fairly stable results on the directly optimized-for word and role prediction tasks.

Limitations
There are certain limitations that were unavoidable in this work. One of them is the limited size of the available training and evaluation datasets for testing thematic fit tasks. It is likely that the high variance we observed is due to both our indirect supervision approach (in part due to lack of directly relevant data for training), and the small-size test sets. We are limited here by the state of the art in such datasets, not just by their size. It is a complex task to create and evaluate thematic fit with full phrases and sentences, i.e., not just with the arguments' syntactic heads. Since we do not know of any such datasets, our model was designed with only syntactic heads in mind.
Another limitation is the training dataset quality: due to its size, the training data was machineannotated (for syntactic parsing, SRL and lemmas) and therefore unintended noise and bias may have been introduced in the models. In addition, even though our training datasets were collected with the goal of making them domain-general and balanced, it is hard to enforce and verify that in large sizes. We take issues such as toxicity and gender bias seriously, but we think that in our settings, where the model does not generate language and the test sets do not involve gendered examples, the related risks approach zero.
Semantic tasks such as thematic fit would most likely benefit from training on grounded language, e.g., combining text and vision, but working with such datasets is beyond the scope of this work.
Finally, a rather trivial limitation we have is the number of trials per experiment we could run due to time and computational constraints. We only ran 3-5 trials per experiment but a larger number of trials may yield more robust results. Despite all these limitations, we believe our work gives a very comprehensive analysis of the ResRoFaMT model and opens up some interesting avenues for future research work.

Ethical Considerations
Our work uses RW-Eng v2 (Marton and Sayeed, 2022), which in turn uses two corpora: ukWaC and the BNC. Therefore, we have similar ethical concerns as mentioned in that previous work, including the way the BNC data was collected. Those who so wish can easily exclude the BNC data (it comprises only a small part of the whole corpus) and retrain.
The RW-Eng corpus (v1 or v2) could introduce undesired bias in use outside the UK, since the data is sourced entirely from UK web pages and other UK sources from the 20th century. English used outside the UK, and more recent English anywhere, differ from this corpus in their word distributions, and therefore their input may yield sub-optimal or undesired results. Furthermore, models trained on it could encode a Western-centric view of the world.
The silver labels -the automatic parsing and tagging of the corpus -could introduce bias from the parsing/tagging algorithms. These parsers/taggers are also trained models, which could be affected by their data sources. If this is a concern for some users, we encourage them to perform validation of the data and its annotations.
Having said that, we believe that for most if not all conceivable applications, especially as long as one keep these limitations in mind, our work should not pose any practical risk.