A Usage-Based Model of Early Grammatical Development

The representations and processes yielding the limited length and telegraphic style of language production early on in acquisition have received little attention in ac-quisitional modeling. In this paper, we present a model, starting with minimal linguistic representations, that incrementally builds up an inventory of increasingly long and abstract grammatical representations (form+meaning pairings), in line with the usage-based conception of language acquisition. We explore its performance on a comprehension and a generation task, showing that, over time, the model better understands the processed utterances, generates longer utterances, and better expresses the situation these utterances intend to refer to.


Introduction
A striking aspect of language acquisition is the difference between children's and adult's utterances.Simulating early grammatical production requires a specification of the nature of the linguistic representations underlying the short, telegraphic utterances of children.In the usage-based view, young children's grammatical representions are thought to be less abstract than adults', e.g. by having stricter constraints on what can be combined with them (cf.Akhtar and Tomasello 1997;Bannard et al. 2009;Ambridge et al. 2012).The representations and processes yielding the restricted length of these early utterances, however, have received little attention.Following Braine (1976), we adopt the working hypothesis that the early learner's grammatical representations are more limited in length (or: arity) than those of adults.
Similarly, in computational modeling of grammar acquisition, comprehension has received more attention than language generation.In this paper we attempt to make the mechanisms underlying early production explicit within a model that can parse and generate utterances, and that incrementally learns constructions (Goldberg, 1995) on the basis of its previous parses.The model's search through the hypothesis space of possible grammatical patterns is highly restricted.Starting from initially small and concrete representations, it learns incrementally long representations (syntagmatic growth) as well as more abstract ones (paradigmatic growth).Several models address either paradigmatic (Alishahi and Stevenson, 2008;Chang, 2008;Bannard et al., 2009) or syntagmatic (Freudenthal et al., 2010) growth.This model aims to explain both, thereby contributing to the understanding of how different learning mechanisms interact.As opposed to other models involving grammars with semantic representations (Alishahi and Stevenson, 2008;Chang, 2008), but similar to Kwiatkowski et al. (2012), the model starts without an inventory of mappings of single words to meanings.
Based on motivation from usage-based and construction grammar approaches, we define several learning principles that allow the model to build up an inventory of linguistic representations.The model incrementally processes pairs of an utterance U , consisting of a string of words w 1 . . .w n , and a set of situations S, one of which is the situation the speaker intends to refer to.The other situations contribute to propositional uncertainty (the uncertainty over which proposition the speaker is trying to express; Siskind 1996).The model tries to identify the intended situation and to understand how parts of the utterance refer to certain parts of that situation.To do so, the model uses its growing inventory of linguistic representations (Section 2) to analyze U , producing a set of structured semantic analyses or parses (Fig. 1, arrow 1; Section 3).
The resulting best parse, U and the selected situation are then stored in a memory buffer (arrow 2), which is used to learn new constructions (arrow 3) using several learning mechanisms (Section 4).The learned constructions can then be used to generate utterances as well.We describe two experiments: in the comprehension experiment (Section 5), we evaluate the model's ability to parse the stream of input items.In the generation experiment (Section 6), the model generates utterances on the basis of a given situation and its linguistic knowledge.We evaluate the generated utterances given different amounts of training items to consider the development of the model over time.

Representations
We represent linguistic knowledge as constructions: pairings of a signifying form and a signified (possibly incomplete) semantic representation (Goldberg, 1995).The meaning is represented as a graph with the nodes denoting entities, events, and their relations, connected by directed unlabeled edges.The conceptual content of each node is given by a set of semantic features.We assume that meaning representations are rooted trees.The signifying form consists of a positive number of constituents.Every constituent has two elements: a phonological form, and a pointer to a node in the signified meaning (in line with Verhagen 2009).Both can be specified, or one can be left empty.Constituents with unspecified phonological forms are called open, denoted with in the figures.The head constituent of a construction is defined as the constituent that has a pointer to the root node of the signified meaning.We furthermore require that no two constituents point to the same node of the signified meaning.
This definition generalizes over lexical elements (one phonologically specified constituent) as well as larger linguistic patterns.Fig. 2, for instance, shows two larger constructions being combined with each other.We call the set of constructions the learner has at some moment in time the constructicon C (cf. Goldberg 2003).

Parsing operations
We first define a derivation d as an assembly of constructions in C, using four parsing operations defined below.In parsing, derivations are constrained by the utterance U and the situations S, whereas in production, only a situation s constrains the derivation.The leaf nodes of a derivation must consist of phonological constraints of constructions that (in parsing) are satisfied by U .All constructions used in a derivation must map to the same situation s ∈ S. A construction c maps to s iff the meaning of c constitutes a subgraph of s, with the features on each of the nodes in the meaning of c being a subset of the features on the corresponding node of s.Moreover, each construction must map to a different part of s.This constitutes a mutual exclusivity effect in analyzing U : every part of the analysis must contribute to the composite meaning.A derivation d thus gives us a mapping between the composed meaning of all constructions used in d and one situation s ∈ S. The aggregate mapping specifies a subgraph of s that constitutes the interpretation of that derivation.
The central parsing operation is the COMBINA-TION operator •.In c i • c j , the leftmost open constituent of c i is combined with c j .Fig. 2 illustrates COMBINATION.COMBINATION succeeds if both the semantic pointer of the leftmost open constituent of c i and the semantic pointer of the head constituent of c j map to the same semantic node of a situation s Initially, the model has few constructions to analyze the utterance with.Therefore, we define three other operations that allow the model to create a derivation over the full utterance without combining constructions.First, a known or unknown word that cannot be fit into a derivation, can be IGNOREd.Second, an unknown word can be used to fill an open constituent slot of a construction with the BOOTSTRAP operator.Bootstrapping entails that the unknown word will be associated with the semantics of the node.Finally, the learner can CONCATENATE multiple derivations, by linearly sequencing them, thus creating a more complex derivation without combining con-   structions.This allows the learner to interpret a larger part of the situation than with COMBINA-TION only.The resulting sequences may be analyzed in the learning process as constituting one larger construction, consisting of the parts of the concatenated derivations.Fig. 3 illustrates these three operations.

Selecting the best analysis
Multiple derivations can be highly similar in the way they map parts of U to parts of an s ∈ S. We define a parse to be a set of derivations that have the same internal structure and the same mappings to a situation, but that use different constructions in doing so (cf.multiple licensing; Kay 2002).We take the most probable parse of U to be the best analysis of U .The most probable parse points to a situation, which the model then assumes to be the identified situation or s identified .If no parse can be made, s identified is selected at random from S.
The probability of a parse p is given by the sum of the probabilities of the derivations d subsumed under that parse, which in turn are defined as the product of the probabilities of the constructions c used in d. (1) The probability of a construction P (c) is given by its relative frequency (count) in the constructicon C, smoothed with Laplace smoothing.We assume that the simple parsing operations of IG-NORE, BOOTSTRAP, and CONCATENATION reflect usages of an unseen construction with a count of 0.
The most probable parse, U and s identified are added to the memory buffer.The memory buffer has a pre-set maximal length, discarding the oldest exemplars upon reaching this length.In the future, we plan to consider more realistic mechanisms for the memory buffer, such as graceful degradation, and attention effects.

Learning mechanisms
The model uses the best parse of the utterance to develop its knowledge of the constructions in the constructicon C. Two simple operations, UPDATE and ASSOCIATION, are used to create initial constructions and reinforce existing ones respectively.Two additional operations, PARADIGMATIZATION and SYNTAGMATIZATION, are key to the model's ability to extend these initial representations by inducing novel constructions that are richer and more abstract than existing ones.

Direct learning from the best parse
The best parse is used to UPDATE C. For this mechanism, the model uses the concrete meaning of s identified rather than the (potentially more abstract) meaning of the constructions in the best parse.1 Every construction in the parse is assigned the subgraph of s identified it maps to as its new meaning, and the count of the adjusted construction is incremented with 1, or added to C with a count of 1, if it does not yet exist.This includes applications of the BOOTSTRAP operation, creating a mapping of the previously unknown word to a situational meaning.
ASSOCIATE constitutes a form of simple crosssituational learning over the memory buffer.The intuition is that co-occurring word sequences and meaning components that remain unanalyzed across multiple parses might themselves comprise the form-meaning pairing of a construction.If the unanalyzed parts of two situations contain an overlapping subgraph, and the unanalyzed parts of two utterances an overlapping subsequence of words, the two are mapped to each other and added to C with a count of 0.

Qualitative extension of the best parse
Syntagmatization Some of the processes described thus far yield analyses of the input in which constructions are linearly associated but lack appropriate relational structure among them.The model requires a process, which we call SYN-TAGMATIZATION, that enables it to induce further hierarchical structure.
In order for the learner to acquire constructions in which the different constituents point to different parts of the construction's meaning, the ASSO-CIATE operation does not suffice.We assume that the learner is able to learn such constructions by using concatenated derivations.The process we propose is SYNTAGMATIZATION.In this process, the various concatenated derivations are taken as constituents of a novel construction.This instantiates the idea that joint processing of two (or more) events gradually leads to a joint representation of these, previously independent, events.
More precisely, the process starts by taking the top nodes T of the derivations in the best parse, where T consists of the single top node if no CON-CATENATION has been applied, or the set of concatenated nodes of the parse tree if CONCATENA-TION has been applied (e.g. for the derivation in Fig. 3, |T | = 2).For each top node t ∈ T , we take the root node of the construction's meaning, and define its semantic frame to consist of all children (roles) and grandchildren (role-fillers) of the node in the situation it maps to.The model then forms a novel construction c syn by taking all the constructions in the parse whose semantic root nodes point to a node in this semantic frame, referring to those as the set R of semantically related constructions.As the novel meaning of c syn , the model takes the subgraph of the situation mapped to by the joint mapping of all constructional meanings of constructions in R.
R, as well as all phonologically specified constituents of t itself, are then linearized as the constituents of c syn .The novel construction thus constitutes a construction with a higher arity, 'joining' several previously independent constructions.Fig. 4 illustrates the syntagmatization mechanism.
Paradigmatization Due to our usage-driven approach, all learning mechanisms so far give us maximally concrete constructions.In order for the model to generalize beyond the observed input, some degree of abstraction is needed.The model does so with the PARADIGMATIZATION mechanism.This mechanism recursively looks for minimal abstractions (cf.Tomasello 2003, 123) over the constructions in C and adds those to C, thus creating a full-inheritance network (cf.Langacker 1989, 63-76).
An abstraction over a set of constructions is made if there is an overlapping subgraph between the meanings of the constructions, where every node of the subgraph is the non-empty feature set intersection between two mapped nodes of the constructional meanings.Furthermore, the con-

A novel, syntagmatized construction
Figure 4: The SYNTAGMATIZATION mechanism.The mechanism takes a derivation as its input and reinterprets it as a novel construction of higher arity).
stituents must be mappable: both constructions have the same number of constituents and the paired constituents point to a mapped node of the meaning.The meaning of the abstracted construction is then set to this overlapping subgraph, which is the lowest possible semantic abstraction over the constructions.The constituents of this new abstraction have a specified phonological form if the more concrete constructions share the same word, and an unspecified one otherwise.The count of an abstracted construction is given by the cardinality of the set of its direct descendants in the network.This generalizes Bybee's (1995) idea about type frequency as a proxy for productivity to a network structure.Fig. 5 illustrates the paradigmatization mechanism.

Experimental set-up
The model is incrementally presented with U, S pairings based on Alishahi & Stevenson's (2010) generation procedure.In this procedure, an utterance and a semantic frame expressing its meaning (a situation) are generated.The generation procedure follows distributions occurring in a corpus of child-directed speech.As we are interested in the performance of the model under propositional uncertainty, we add a parametrized number of randomly sampled situations, so that S consists of the situation the speaker intends to refer to (s correct ) and a number of situations the speaker does not intend to refer to.2Here, we set the number of ad-ditional situations to be 1 or 5; the other parameter of the model, the size of the memory buffer, is set to 5 exemplars.
For the comprehension experiment, we evaluate the model's performance parsing the input items, averaging over every 50 U, S pairs.We track the ability to identify the intended situation from S. Identification succeeds if the best parse maps to s correct , i.e. if s identified = s correct .Next, situation coverage expresses what proportion of s identified has been interpreted and thus how rich the meanings of the used constructions are.It is defined as the number of nodes of the interpretation of the best parse, divided by the number of nodes of s identified .Finally, utterance coverage tells us what proportion of U has been parsed with constructions (excluding IGNORED; including BOOT-STRAPPED words).The measure expresses the proportion of the signal that the learner (correctly or incorrectly) is able to interpret.
For exploring language production, the model receives a situation, and (given the constructicon) finds the most probable, maximally expressive, fully lexicalized derivation expressing it.That is: among all derivations terminating in phonologically specified constituents, it selects the derivations that cover the most semantic nodes of the given situation.In the case of multiple such derivations, it selects the most probable one, following the probability model in Section 3. We only allow for the COMBINATION operator in the derivations, as BOOTSTRAPPING and IGNORE rewith the intended situation, to reflect more realistic input (cf. Siskind 1996).fer to words in a given U , and CONCATENATE is a back-off method for analyzing more of U than the constructicon allows for.The situations used in the generation experiment do not occur in the training items, so that we truly measure the model's ability to generate utterances for novel situations.
The phonologically specified leaf nodes of the best derivation constitute the generated utterance U gen .U gen is evaluated on the basis of its mean length, in number of words, its situation coverage, as defined in the comprehension experiment, and its utterance precision and utterance recall.
To calculate these, we take the maximally overlapping subsequence U overlap between the actual utterance U act associated with the situation and U gen .Utterance precision (how many words are generated correctly) and utterance recall (how many of the correct words are generated) are defined as: Because the U, S-pairs on which the model was trained, are generated randomly, we show results for comprehension and production averaged over 5 simulations.

Experiments
A central motivation for the development of this model is to account for early grammatical production: can we simulate the developmental pattern of the growth of utterance length and a growing potential for generalization?The same constructions underlying these productions should, at the same time, also account for the learner's increasing grasp of the meaning of U .To explore the model's performance in both domains, we present a comprehension and a generation experiment.

Comprehension results
Fig. 6a gives us the results over time of the comprehension measures given a propositional uncertainty of 1, i.e. one situation besides s correct in S. Overall, the model understands the utterances increasingly well.After 2000 input items, the model identifies s correct in 95% of the cases.With higher levels of propositional uncertainty (not shown here), performance is still relatively robust: given 5 incorrect situations in S, s correct is identified in 62% of all cases (random guessing gives a score of 17%, or 1 6 ).Similarly, the proportion of the situation interpreted and the proportion of the utterance analyzed go up over time.This means that the model builds up an increasing repertoire of constructions that allow it to analyze larger parts of the utterance and the situations it identifies.It is important to realize that these mea-

Generation results
Quantitative results Fig. 6b shows that the average utterance length increases over time.This indicates that the number of constituents of the used constructions grows.Next, Fig. 6c shows the performance of the model on the generation task.
After 2000 input items, the model generates productions expressing 93% of the situation, with an utterance precision of 0.91, and an utterance recall of 0.81.Given a propositional uncertainty of 5, these go down to 79%, 0.76 and 0.59 respectively.
Comparing the utterance precision and recall over time, we can see that the utterance precision is high from the start, whereas the recall gradually increases.This is in line with the observation that children predominantly produce errors of omission (leaving linguistic material out an adult speaker would produce), and few errors of comission (producing linguistic material an adult speaker would not produce).

Qualitative results
Tracking individual productions given specific situations over time allows us to study in detail what the model is doing.Here, we look at one case qualitatively.Given the situation for which the U act is she put them away, the model generates, over time, the utterances in Table 1.The brackets show the internal hierarchical structure of the derivation.This development illustrates several interesting aspects of the model.First, as discussed earlier, the model mostly makes errors of omission: earlier productions leave out more words found in the adult utterances.Only at t = 550, the model makes an error of commission, using the word in erroneously.Starting from t = 600 (except at t = 950), the model generates the correct utterance, but the derivations leading to this production differ.At t = 550, for instance, the learner combines a completely non-phonologically specific construction for which the constituents refer to the agent, action and goal location, with three 'lexical' constructions that fill in the words for those items..The constructions used after t = 550 are all more specific, combining 3, or even only 2 constructions (t ≥ 1400) where the entire sequence of words "put them away" arises from a single construction.
Using less abstract constructions over time seems contrary to the usage-based idea that constructions become more abstract over the course of acquisition.However, this result follows from the way the probability model is defined.More specific constructions that are able to account for the input will entail fewer combinations, and a derivation with fewer combination operations will often be more likely than one with more such operations.Given equal expressivity of the situation, the former derivation will be selected over the latter in generation.
The effect is indeed in line with another concept hypothesized to play a role in language acquisition on a usage-based account, viz.pre-emption (Gold-  , 2006, 94-95).Pre-emption is the effect that a language user will select a more concrete representation over the combination of more abstract ones.The effect can be reconceptualized in this model as an epiphenomenon of the way the probability model works: simply because combining fewer constructions in a derivation is often more probable than combining more constructions, the former derivation will be selected over the latter.Pre-emption is typically invoked to explain the blocking of overgeneralization patterns, and an interesting future step will be to see if the model can simulate developmental patterns for well-known cases of overgeneralization errors.
The potential for abstraction The paradigmatization operation allows the model to go beyond observed concrete instances of form-meaning pairings: without it, unseen situations could never be fully expressed.Despite this potential, we have seen that the model relies on highly concrete constructions.The concreteness of the used patterns, however, does not imply the absence of more abstract representations.Fig. 7 gives three examples of constructions in C in one simulation.Construction (a) could be seen as a verb-island construction (Tomasello, 1992, 23-24).The second constituent is phonologically specified with put, and the other arguments are open, but mapped to specific semantic functions.This pattern allows for the expression of many caused-motion events.Construction (b) is the inverse of (a): the arguments are phonologically specified, but the verbslot is open.This would be a case of a pronominal argument frame [you V it], which have been found to be helpful in the bootstrapping of verbal mean-ings (Tomasello, 2001).Finally, (c) presents a case of full abstraction.This construction licenses utterances such as I sit here, you stay there and erroneous ones like he sits on (which, again, will be pre-empted in the generation of utterances if more concrete constructions licence he sits on it).Summarizing, abstract constructions are acquired, but only used for those cases in which no concrete construction is available.This is in line with the usage-based hypotheses that abstract constructions do emerge, but that for much of language production, a language user can rely on highly concrete patterns.A next step will be to measure the development of abstractness and length over the constructions themselves, rather than the parses and generations they allow.

Conclusion
This, admittedly complex, model forms an attempt to model different learning mechanisms in interaction from a usage-based constructionist perspective.Starting with an empty set of linguistic representations, the model acquires words and grammatical constructions simultaneously.The learning mechanisms allow the model to build up increasingly abstract, as well as increasingly long constructions.With these developing representations, we showed how the model gets better over time at understanding the input item, performing relatively robustly under propositional uncertainty.
Moreover, in the generation experiment, the model shows patterns of production (increasingly long utterances) similar to those of children.An important future step will be to look at these productions more closely and investigate if they also converge on more detailed patterns of development in the production of children (e.g.itemspecificity, as hypothesized on the usage-based view).Despite highly concrete constructions sufficing for most of production, inspection of the acquired representations tells us that more abstract constructions are learned as well.Here, an interesting next step would be to simulate patterns of overgeneralization in children's production.

Figure 1 :
Figure 1: The global flow of the model

Figure 2 :
Figure 2: Combining constructions.The dashed lines represent semantic pointers, either from constituents to the constructional meaning (black) or from the constructions to the situation (red and blue).

Figure 3 :
Figure 3: The CONCATENATE, IGNORE and BOOTSTRAP operators (internal details of the constructions left out).

Figure 5 :
Figure 5: The PARADIGMATIZATION mechanism.The construction on top is an abstraction obtained over the two constructions at the bottom.

Figure 6 :
Figure 6: Quantitative results for the comprehension and generation experiments

Table 1 :
Generations over time t for one situation.