Language Models Understand Us, Poorly

Some claim language models understand us. Others won't hear it. To clarify, I investigate three views of human language understanding: as-mapping, as-reliability and as-representation. I argue that while behavioral reliability is necessary for understanding, internal representations are sufficient; they climb the right hill. I review state-of-the-art language and multi-modal models: they are pragmatically challenged by under-specification of form. I question the Scaling Paradigm: limits on resources may prohibit scaled-up models from approaching understanding. Last, I describe how as-representation advances a science of understanding. We need work which probes model internals, adds more of human language, and measures what models can learn.


Introduction
A theme of EMNLP this year is "unresolved issues in NLP." Hence I consider what it means to understand human language, whether current language models understand and whether future models will.
So what's next?I identify three views on language understanding ( §2): understandingas-mapping, understanding-as-reliability, and understanding-as-representation.Through examples of recent limitations of language models ( §4), I argue for understanding-as-representation because it climbs the right hill ( §3).In particular, I question the assumption that scaling current models is computationally feasible to lead to human-like understanding ( §5).Because of the large gap between human and model understanding, I think it is generally misapplied to say that models "understand" ( §6.1).Better applied are examples of promising work on understanding ( §6.2).

Views on Understanding
Some argue that there is a strict barrier which separates human from machine understanding (Bender and Koller, 2020;Searle, 1980).Understandingas-mapping puts understanding in terms of an absolute mapping between form and meaning.Here, meaning comes from what a series of forms describes.Those forms can be composed in a variety of ways to yield different, legible meanings.1 Often, those with this view imply humans have special access to meaning.
Others argue that we ought be rid of the distinction between human and machine understanding.They imply models will close the gap soon enough (Manning, 2022;Agüera y Arcas, 2022;Kurzweil, 2005;Turing, 1950).Understandingas-reliability puts understanding as a question of reliable communication: can one agent expect another agent to respond to stimuli in a certain way?2This view assumes that scaling alone will lead to an agent capable of human-like language; system internals don't matter.For example, in the most extreme case we can imagine a very large look-up table with state (cf.Russell and Norvig 2021): a mapping from every input sequence to a sensible output sequence.
In this paper, I put understanding in terms of internal, dynamical representation: when prompted with a stimulus, does an agent reproduce an internal representation similar enough to that intended?Call this understanding-asrepresentation.Many have proposed related theories (Shanahan and Mitchell, 2022;Barsalou, 2008;Hofstadter and Sander, 2013;Jackendoff et al., 2012;Grice, 1989).In this view, if someone unthinkingly blurts out the correct answer to a question, they would not have understood.While a thermostat reproduces a certain representation given a temperature this representation is not similar to a person's.Some have said that models appear not to understand because their interrogators fail to present stimuli in a model-understandable way (Michael 2020 summarizes).Exactly: I am concerned with human language understanding-not any possible form of understanding.
To advance a science of understanding, I argue that as-reliability is necessary, as-representation is sufficient, and as-mapping is neither.
I reject the premise of as-mapping that the way we use words is separate from our meanings.While current work in NLP poorly approximates shared intentionality3 I disagree that this is the only route to meaning. 4 We could imagine a very large lookup table.There is no boundary between what is and what is not a language. 5 accept as-reliability in theory.Enough data and parameters should yield a language-performant agent indistinguishably similar to a human tested on byte streams passed along a wire.Similarly, Potts (2022) argues that a self-supervised foundation model could do so.Still, I am skeptical of what I call the Scaling Paradigm, that scale alone is a realistic approach.
I think that hill climbing works but we're climbing the wrong hill.

Climbing the Right Hill
As-representation and as-reliability are compatible: we may care about representation but more easily look for reliability.I argue that input-output behavioral tests are necessary but may not be sufficient to attribute understanding-we may need to look inside. 6onetheless, Alisha, when messaging with Bowen, has no need to look inside Bowen's head to verify that he understood the following exchange: A: I'm unhappy.B: Why aren't you happy?
Our human bias is to assume that other agents understand until evidence proves otherwise (Weizenbaum, 1976).This is pragmatic; until recently humans did not encounter non-human agents who could respond somewhat reliably.Humans assume a similarity of representation, that others have the same inductive biases.
We can't make that assumption with our models.We can't assume that a chat-bot has a bias to coo over babies (cf. Hrdy 2009).This is why Turing's (1948) test doesn't work-the smoke and mirror programs which won the Loebner prize unintentionally parody input-output tests (Minsky, 1995).Reliability, while useful, alone does not advance a science of understanding.As-reliability does not tell us which biases induce understanding.It is not causal.
Granted, humans' internal representations are difficult to measure, may change at each point of access, and in AI we've historically leaned too heavily on certain putative representations.Sutton (2019) calls this a "bitter lesson." So why talk of representation?I agree with the "bitter lesson" but I also know that there is no such thing as free lunch; human language occupies a small manifold in the space of possible functions.I don't argue to replicate natural functions but rather to be honest about human strengths lest we wander off into fruitless regions of state space.To do logic, at some internal level a system is going to have to appear to use the parts of logic.
Advancing as-representation does not mean we know what representations underlie human language nor that we must use certain ones.
Advancing as-representation does mean that we pay attention to the constraints on human language usage ( §4).We should use those to guide our benchmark tests for reliability.We should not get lost in our proxies, especially what the Scaling Paradigm assumes ( §5).

Under-specification of Meaning
Language is dynamic (e.g. has a history), intersubjective (multi-agent), grounded in a large number of modalities (senses), collectively intentional (in a cultural context), and more.Present models have little, if any, data on these aspects of language.
As Bisk et al. (2020) make clear in their "world scopes," the majority of work in NLP attempts to learn language from internet text alone.I agree that models have fewer data of the world than humans (Bender and Koller, 2020).What our models see under-specify our meanings.
Recent work has looked into such dissimilarities.McCoy et al. (2021) note that while language models use novel constructions, they copy from training data at a high rate.McCoy et al. (2019) and Branco et al. (2021) identify how models use heuristics and short-cuts contrary to the human meaning of prompts.Shaham et al. (2022) show that a current model will not remember a game after a long-enough context window.
Work has just begun to show limitations of multimodal models.Thrush et al. ( 2022) test for compositionality over images and text: current models perform at chance.Marcus et al. (2022) and Conwell and Ullman (2022) show many similar compositional issues for DALL-E 2 in comparison to humans.Lake and Murphy (2021) note the implausibility of present multi-modal models: e.g. they cannot describe internal desires or change beliefs.
Thus my claim is not that models can't learn meaning.My claim is that for models to approach human meaning they will require data on aspects of language the field has only begun to investigate.

Consider two examples of under-specification:
Under-specification of physics.A model over text and static images would perform poorly on a query such as, "Can you remove this block without causing the tower to fall?" paired with an image where a finger points at a block that could obviously be removed (or obviously not).
Under-specification of time.In Western contexts respondents associate earlier to the left and later to the right.This is not specified in language and is mutable (Casasanto and Bottini 2014, and for the example).Thus I expect models which have no notion of time to perform poorly on tests of temporal bias such as those in Fig. 1.

Challenges to the Scaling Paradigm
But what of more data?The Scaling Paradigm tells us that scale alone-more parameters, more data, more modalities-will be enough to approximate human language understanding.Exponential increases in parameters or training data have yielded Prompt: are each of these events temporally earlier (A), present (B), or later (C)? one day after A B (C) retfa yad eno (C) B A a day before (A) B C erofeb yad a C B (A) Figure 1: Stimuli to probe temporal biases.Answers (bolded).In human subjects response times are shorter, once trained, for pairings shown and longer when the order of the answers is reversed.A model with similar temporal bias should assign a higher probability to the correct answer in the intuitive ordering (shown).
Still, text doesn't seem like enough.Merrill et al. (2021a) argue that there aren't enough examples to learn meaning from form in languages of assertions.Even Chowdhery et al. (2022, pg. 48) admit to be running out of clean data for exponentiallybigger models.Furthermore, by the age of five, the average American child has heard between ten and fifty million words (Sperry et al., 2019).A state-ofthe-art model sees from 10k to 100k more words than a kid. 7For some (e.g.Linzen 2020), the argument stops there: our language models must not be learning the correct functions because they require more data to generalize.
Nonetheless, humans have plenty of data to compare language use besides the words they hear (Tomasello, 2003;Lakoff and Johnson, 1980).For example, Smith et al. (2018) show that children at first need items to be visually centered to learn them.So the claim that models learn the wrong function may only apply when limited to text.
What of more modalities, then?To extrapolate on the figures of PaLM (Chowdhery et al., 2022, §13), we only need to scale a model to about 2 12 billion parameters and train it for 2.59 × 10 26 flops (and increase training data) for perfect performance on a variety of English NLP tasks.For perspective, Heim (2022) estimates the cost of PaLM at 10 million USD which, at the 100x projected, is 1 billion.These figures seem infeasible.Furthermore, the tests in PaLM do not capture the limits I mentioned for human-like understanding ( §4); we would need to add image (Ramesh et al., 2022) and game (Reed et al., 2022) networks as well.I don't know exactly what effect these added modalities will have except that they will increase the exponent on scale.I am wary that the long tail of returns embraced by the Scaling Paradigm chases an exponential. 8The primacy of scale implies an exponentially diminishing longtail of capability.How soon until our models become planetary in size, as Bostrom (2014) portends?I thus recast the debate, not as what can theoretically be learned from data (as the Scaling Paradigm trumpets), but as the computational efficiency of different learning approaches.To asymptotically approach an approximation of human language understanding, how many parameters, data, and modalities will our models need?
These concerns about scale make me hesitant to suggest that efforts will soon close the gap between human and machine understanding (contra as-reliability) even as I agree that they will narrow it (contra as-mapping).

Sorta Understands != Understands
To admit models may one day understand like us and to claim they now understand in limited domains, need not lead us to give understanding over to machines far to one end of the spectrum.These models understand parts of language-just not in the same way as humans understand each other.They are not biased in the same way humans are.Machines will make unknown mistakes unless we can interrogate their representations.Moreover, attributing "understanding" is consequential.Some argue for granting moral rights to robots (Gordon and Gunkel, 2021).Page describes those decrying AI as "speciesist" (Tegmark, 2017).Bryson (2010) argues to the contrary.But note that understanding and moral status aren't the same.I urge caution over assuming the distinction between human and machine will disappear anytime soon.
We should not let the theoretical capacities of AI blind us to present realities.Saying that current large language models understand is, as McDermott (1976) described a while back, another case of a "wishful mnemonic." 8 While some note the exponential scales of large models (Thompson et al., 2021;John and Musser, 2022) they may not account for counter-measures (Patterson et al., 2022).

Pragmatic NLP
Instead of denying or abandoning understanding, representation advances a science.It allows us to answer: how does this system understand?How similar are the representations of these two systems?This is what Harnad (1990) and Santoro et al. (2022) call for.Indeed, as-representation describes emerging trends in NLP which . . .Probe model internals.While most benchmark tasks focus on input-output reliability (Linzen, 2020;Zhang et al., 2021), investigating understanding will require functional analysis.Buckner (2020) calls for us to determine a taxonomy of the non-robust features detected by deep nets.Beckers et al. (2020) show how to compare causal models at multiple levels of granularity.Geiger et al. (2021), Li et al. (2021), and Lovering and Pavlick (2022) extend this analysis with interventions to ask which representations (simpler models) approximate large language models.Olsson et al. (2022) find evidence for so-called induction heads in transformer models.Johnston and Fusi (2022) find abstract "representations" emergent from neural networks trained on similar tasks.Merrill et al. (2021b) investigate norm growth saturation in transformers as their inductive biases.
Add more of human language such as intersubjective, multi-agent environments.Noukhovitch et al. (2021) find that partially-competitive agents learn to use symbols.In a text game, Hendrycks et al. (2021) gauge moral valence.This is as Firestone (2020) describes, to use species-fair humanmachine comparisons.
We might focus on human data constraints, such as the CHILDES database of language learning (MacWhinney, 2000;Linzen, 2020).Hill et al. (2020) show how the increased modality of data to a deep network may lead to generalizability.Contra strict composition, Santoro et al. (2022) argue for the inclusion of "socio-cultural interactions."These are similar to calls for dynamic grounding (Chandu et al., 2021) and "common sense" (Sap et al., 2020).For example,we need models which not only resolve gaze (Koleva et al., 2015) but also deploy sharing gaze to sharing in other domains.
On generalizability Mitchell (2019) asks us to consider micro domains: "abc:abd; xxyyzz:?".Şahin et al. (2020) propose a task with a small dataset from Rosetta Stone.Brachman and Levesque (2022) propose using only the kids version of Wikipedia, KidzSearch, to reduce the number of entities on which to train.A language performant agent should be able to do well in this microdomain alone.
Measure what models can learn.We need more work on the tractability of meaning along the lines of Merrill et al. (2021a) on a language of assertions.How many different streams of data (or "world scopes" Bisk et al. 2020) must we add to models to make them more reliable?What kind of scaling factors can we reasonably expect as we include more aspects of human language?

Conclusion
Present language models have limited access to meaning ( §4) and scale alone may not be sufficient to achieve human-level understanding ( §5), at least until we can guarantee similar representations or inductive biases ( §2).
Inference tasks reduce the space of possible entities implied by conventional usage.Still, we must be assured of what they represent in order to guarantee their reliability.Understanding-as-mapping would deny the distributional compositionality of our languages and of our minds.Understanding-asreliability would claim our machines "get" meaning so long as we focus only on temporally limited usage.Understanding-as-representation would focus on accurate, measurable, and tractable meaning.

Limitations
This is only one perspective of many possible on the nature of understanding.Given the short form of this presentation I am not able to do justice to the diverse fields-from philosophy of science to linguistics to brain science-which I reference.I may therefore insufficiently explain the relevant literature to my target readers in NLP.While I have attempted to present a well-informed prior on the future direction of AI, my perspective is nonetheless uncertain; I may ultimately be incorrect.