Multi-Resolution Language Grounding with Weak Supervision

Language is given meaning through its correspondence with a world representation. This correspondence can be at multiple levels of granularity or resolutions . In this paper, we introduce an approach to multi-resolution language grounding in the extremely challenging domain of professional soccer commentaries. We deﬁne and optimize a factored objective function that allows us to leverage discourse structure and the compositional nature of both language and game events. We show that ﬁner resolution grounding helps coarser resolution grounding, and vice versa. Our method results in an F1 improvement of more than 48% versus the previous state of the art for ﬁne-resolution grounding 1 .


Introduction
Language is inextricable from its context. A human language user interprets an utterance in the context of, among other things, their perception of the world. Grounded language acquisition algorithms imitate this setup: language is given meaning through its correspondence with a rich world representation. A solution to the acquisition problem must resolve several ambiguities: the segmentation of the text into meaningful units (spans of words that refer to events); determining which events are being referenced; and finding the proper alignment of events to these units.
Historically, language grounding was only possible over simple controlled domains and rigidly structured language. Current research in grounded language acquisition is moving into real-world environments (Yu and Siskind, 2013). Grounding sports commentaries in game events is a specific instance of this problem that has attracted attention (Liang et al., 2009;Snyder and Barzilay, 2007;Hajishirzi et al., 2012), in part because of the complexity of both the language and the world representation involved.
The language employed in soccer commentaries is difficult to ground due to its dense information structure, novel vocabulary and word senses, and colorful, non-traditional syntax. These challenges conspire to foil most language processing techniques including automated parsers and wordsense disambiguation systems.
In addition to the structural problems presented by the language of soccer commentaries, the problem of reference is further complicated by the fact that for game events (and other real-world phenomena) there is no standardized meaningful linguistic unit. Utterances ranging from a single word to multiple sentences can be used to refer to a single event. For example, in Figure 1 the first four words of commentary (I) refer to a single event, as does the entirety of (II). Turning our attention to Figure 2, sometimes a fragment refers to a combination of events and no further decomposition is available, such as the first fragment of commentary (I). Moreover, it is sometimes desirable to construct a complex of events by determining all the events corresponding to a particular collection of words. For instance, we would want to be able to align the whole of (I) with all the events in the corresponding dashed box. This suggests studying language grounding at multiple levels of granularity (resolutions).
We use resolution to describe the continuum of meaningful units which exist in human language 2 . These resolutions interact in a complicated way, with clues from different resolutions sometimes combining to produce an effect and sometimes negating one another. With enough training data, one could hope to learn the details of the interactions of various resolutions. However, the expense of producing or obtaining supervised training data at multiple resolutions is prohibitive.
To address all these complications, we introduce weakly-supervised multi-resolution language grounding. Our method makes use of a factorized objective function which allows us to model the complex interplay of resolutions. Our language model takes advantage of the discourse structure of the commentaries, making it robust enough to handle the unique language of the soccer domain. Finally, our method relies only on 2 Though it is tempting to discritize meaning in text, Chafe (1988) shows that readers imbue text with meaningful intonational patterns drawn from the potentially continuous space of auditory signals. loose temporal co-occurrence of events and utterances as supervision and does not require expensive annotated training data.
To test our method we augment the Professional Soccer Commentary Dataset (Hajishirzi et al., 2012) with fragment-level event alignment annotations. This dataset is composed of commentaries for soccer matches paired with event logs produced by Opta Sportsdata and includes human annotated gold alignments 3 . We achieve an F1 improvement of over 48% on fragment-level alignment versus a previous state-of-the-art. We are also able to leverage the interplay of fragment-and utterance-level alignments to improve the previous state-of-the-art utterance-alignment system.

Challenges
Syntactic Limitations: Syntax is used to structure the information provided by an utterance, and so it seems intuitive that syntactic relations could be leveraged in this task. For example, consider utterance (III) in Figure 2. The multi-resolution grounding of (III) would provide a segmentation of the utterance -or a division of the utterance into the fragments which refer to separate events. In (III), there is an obvious syntactic correlate to the correct segmentation: each verb phrase within the conjunction headed by "and" identifies a separate event. Parsing (III) to an event-based semantics like that of Davidson (1967), one could associate each verb in an utterance with a game event and achieve the desired segmentation.
Unfortunately, there is a preponderance of examples such as (II) in Figure 2, where 4 verbs are used to describe a single "miss" event. (II) illustrates just one of the many difficulties of using syntactic information -elsewhere, events are referenced without an explicit verb whatsoever (such as the use of the phrase "into the books" to refer to a foul event). What is needed instead is a language model that is powerful enough to proscribe some structure yet robust enough to allow the world representation to determine which pieces of language are referring to which referent or set of referents.

Complex Interplay between Resolutions:
Language refers at a variety of resolutions, and the relationship between nested reference scopes is complex. A single or few words can indicate entities or properties; full phrases are often needed to denote an action; complex events like a missed shot may take up to several phrases of narration to properly describe. A soccer commentator does not encode every detail necessary for proper alignment and segmentation into their utterances, but rather only enough to make clear to another with similar world knowledge what is meant. A language grounding method is at a severe disadvantage when faced with such implicit information.
Instead, a successful method can make heavy use of the limited lexical, phrasal, and discourse structural cues provided in an utterance, as the different resolutions rely on these different contextual clues to meaning. At finer resolutions one can rely more on the lexical meanings of the words; at medium resolutions, compositionality can be leveraged; at coarser resolutions, discourse features come into play. These cues interact in a complicated way, providing additional challenge.
Consider again Figure 2. In (III), the temporal discourse marker "and" marks the division between the fragments referring to each event. In (I) the same word (used again as a temporal discourse marker) is used to elaborate on the single "foul" event being described in the second fragment. A human (with sufficient understanding of soccer) knows that, despite being separated by the discourse marker, the phrases "bring him down" and "set piece" both refer to the foul. A language grounding algorithm that can model the interaction between such word-level and utterance-level cues can successfully segment both (I) and (III).
Supervision: For language grounding generally, and multi-resolution grounding specifically, supervised training data is expensive to produce. Also, the various grounding domains of interest are highly independent of one another (Liang et al., 2009). In the face of these issues, the ideal correspondence between language and world representation would be learned with as little supervision as possible.

Problem Definition
We define the problem of multi-resolution language grounding as follows: Given a temporal evolution of a world state (a sequence of events) and an overlapping natural language text (a sequence of utterances), we want to learn the best correspondences between the language and the world at different levels of granularity ( Figure 2).
To set up notations, for each utterance represented as a set of words W = {w 1 , w 2 , . . . , w n }, we want a segmentation which expresses the relationship of the words to the events which they describe.
Let S denote a set of all possible segmentations of W . Then S = {S|S is a segmentation of W }. A segmentation S is in turn a set of nonoverlapping fragments (S = {s i }), where each fragment is a consecutive sequence of words from the utterance W . For example, for utterance (III) from Figure 2, one possible (incorrect) segmentation is S = {s 1 , s 2 , s 3 } for s 1 ={Chamakh rises highest}, s 2 ={and aims a header}, and s 3 = {towards goal which is narrowly wide}.
An alignment consists of a segmentation S and a mapping E from fragments of S to the set of all events E. For example, the segmentation S could be mapped as E = { s 1 , e 2 , s 2 , e 3 , s 3 , e 1 }, with e 1 being an Aerial Challenge, e 2 being a missed attempt on goal, and e 3 being an out of bounds penalty. Let E = S × E denote the set of all possible alignments.
As we show in Figure 2, events are composed of the various attributes Time, Type, Pass Events, Outcome, and Player. For example, the aerial event in Figure 2 has the attributes and values type:aerial, outcome:successful, pass events:head pass, and player:Chamakh.
Finally, we denote the values for the attributes of each e j as e a j , where a ranges over the different attributes of events as represented in the data.
We define the multi-resolution grounding of W into E as the best segmentation S and alignment E that maximize the joint probability distribution: This optimization 4 can be accomplished through the use of supervised learning. However, training data is expensive and tedious to produce for the grounding problem, especially at multiple resolutions. Additionally, the complexity of the language in this domain would result in very sparse associations.
Yet if we knew some of the correct fineresolution alignments, we could use that information to produce good coarse resolution alignments, and vice versa. Therefore, we formulate a factorized form of the above objective which allows us to learn features specific to aligning at the utterance, fragment, and attribute resolutions. Our method can be optimized with only weak supervision (loose temporal alignments between utterances and a set of events occurring within a window of the utterance time).
We can evaluate such a correspondence in several ways. For each utterance, can we predict the correct events to which this utterance refers? This is the problem of utterance-level alignment.
We can also evaluate based on events: for each event, can we identify the minimal text span(s) which refers to this event? We want a tight correspondence because loose, overlapping alignments are not semantically satisfying. However, we do not want to under associate: human language makes reference at a variety of levels (the word level, the phrase level, the utterance level, and beyond). It is important to correctly identify all and only the words which correspond to a given event. This is the fragment-level alignment problem. We show that good fragment-level alignments will improve utterance-level alignment, and vice versa.
Since events are composed of their attributes, we can imagine a very fine resolution grounding of individual words to individual attributes. In fact, our solution involves producing such a grounding and composing the fragment-and utterance-level alignments therefrom.

Our Method
We have formulated the grounding problem as an optimization of the joint probability distribution P (S, E|W ), which returns the best segmentation and accompanying event alignments given an utterance W . Optimizing this function in the domain of real world language, however, is a difficult problem. Utterances are long here, and there are many events which could be grounded to each. Furthermore, the cardinality of the set of possible segmentations is combinatorially large. Therefore we decompose Equation 1 using the factor graph depicted in Figure 3. We write the joint probability distribution as a product of the following two potential functions: (2) where Ψ align is a function for scoring the alignment E for fragment s and Φ seg scores how good a fragment s is for the utterance W , and Z is for normalization.
To optimize Equation 2 it is not practical to search the space of possible S, E combinations (this space is combinatorially large). However, we can optimize the factored form using dynamic programming. We first describe how to find values for each of the potentials in sections 4.1 and 4.2. In section 4.3 we describe the dynamic programming approach to optimization.

Event Alignments Given Segmentation
The potential function Ψ align (E, s) takes as inputs a fragment s from segmentation S and a candidate alignment E for S and returns a score for E with regards to s. It is here that we produce the multiresolution alignments; s can vary in size from a single word to a whole utterance. ψ align decomposes as the following: where the priors (Ψ prior ) are confidence scores for an alignment E with the whole utterance as given by Hajishirzi et al. (2012), which fits an exemplar SVM to each utterance/event pair. An exemplar SVM is an SVM fit with one positive and many negative instances, allowing us to define an example by what it is not Shrivastava et al., 2011).
Ψ affinity scores the affinity between a fragment s and the event e j to which it is aligned. We use the term affinity as a measure of the goodness of an alignment. Intuitively, a fragment s will have a higher affinity for an event e j if s describes that event well. Formally, the affinity between s and e j amounts to a product of the affinity between each word w i ∈ s and e j . Since e j is defined by a collection of attributes, we can compose a score for w i with e j from the affinity between w i and each attribute a of e j .
where e j is the event to which s is aligned in alignment E, ψ atr. (w i , e j ) is the affinity between w i and event e j , and ψ(w i , e a j ) is the affinity between w i and attribute a of e j .
In order to determine the affinity of a word and an event attribute, we create attribute:value classifiers -one for each attribute:value pair that occurs in any event. For example, for goals we create a type:goal classifier, and for unsuccessful events we create an outcome:unsuccessful classifier.
For the categorical attributes Type, Outcome, and Pass Events, we fit a linear SVM (Fan et al., 2008) using the utterance-level alignments provided by Ψ prior (the exemplar SVMs) to determine the positive and negative examples. For instance, we use all the utterances which are aligned with an event whose type value is "pass" as positive examples for our type:pass classifier, and all other utterances as negative examples.
The weight assigned to each dimension in a linear SVM describes the relative importance of that dimension in the classification process. The dimensions of our attribute:value SVMs are the words of the corpus, normalized for case and minus punctuation and stop words. Therefore, the affinity of a word w i and the attribute:value e a j is the weight of the dimension corresponding to w i in the e a j attribute:value classifier. Following others (Liang et al., 2009;Kate and Mooney, 2007), we use string matches to determine the affinity between a word and the Player attribute.
In order to make comparisons between the importance of a word in the decision process for different classifiers, we normalize the weight vectors for each. These attribute:value classifiers produce our finest resolution alignments, allowing us to define a correspondence between a single word and a single attribute of any event.
By considering e j in terms of its attributes, we are able to compose a score for e j with fragment s. This is a kind of double-sided compositional semantics, where both the meaningful signs (s) and their extensions (e j ) are composed of finerresolution atomic parts (w i and e a j , respectively).

Segmentations Given Utterances
The potential function Φ seg (s, W ) from Equation 2 returns a score for a fragment within an utterance. A segmentation can be thought of as the collection of bigrams w i , w i+1 where w i is the last word of a fragment which is being used to describe one event and w i+1 is the first word of a fragment being used to describe a different event.
We will refer to such bigrams as splitpoints.
The function Φ seg should favor fragments that begin and end at good splitpoint and whose intermediate bigrams are bad splitpoints. We formalize this as follows: where fragment s is a span of m consecutive words {w k , ..., w k+m } from W , and φ is a score for how good of a splitpoint w i , w i+1 would make (explained below). Ideally, φ will be a classifier which can tell us if a given bigram is a good splitpoint for the utterance W . However, ours being an attempt at weakly-supervised learning, we have no labeled examples of correct splitpoints from which to work. Instead, we employ linguistic knowledge to create a proxy of labels. We will use this proxy to train a classifier to discover the features of good splitpoints which can be generalized and produce a more robust system.
The proxy labeling scheme we developed is based on conservative components common to a variety of theories of discourse. Discourse theories aim to model the relationships which exist between adjacent utterances in a coherent discourse. Since we consider a sports commentary to be a coherent discourse, we can leverage results from discourse theory in producing our proxy labels.
Temporal Discourse: Events in a soccer match occur in a temporal sequence, and so it is reasonable to assume that the language used to describe them will employ temporal discourse relations to distinguish fragments describing separate events. Pitler et al. (2008) have constructed a list of discourse relations which can be easily automatically identified, including temporal discourse relations. These are indicated by the presence of discourse markers -alternately known as cue phrases. We hypothesize that cue phrases can be used to identify splitpoints and use them in our proxy labeling scheme. This method is not restricted to temporally related discourse: some contingency, expansion, and comparison relations are also analyzed as "easily identifiable". As such, our segmentation process can also be used to ground language into a world state where these relations would hold.
Prosodic Discourse: We also make use of prosodic discourse cues. Pierrehumbert and Hirschberg (1990) claim that intonational phrases play an important role in discourse segmentation. Therefore, we hypothesize that the edges of intonational phrases are very likely to correspond with correct splitpoints. Viewing the commentary transcriptions as a noisy channel of the actual speech signal, we can identify the intonational phrase boundaries with the punctuation inserted in the transcription process. Chafe (1988) confirms that punctuation in written language has a strong correspondence with intonational phrase boundaries, and an assumption like ours has been successfully implemented in speech synthesis systems (Black and Lenzo, 2000). Thus, we include bigrams containing punctuation as splitpoints in our proxy labels.
Feature Description for splitpoint classifier Is wi/wi+1 a discourse marker? Is wi/wi+1 punctuation? Is wi/wi+1 a player name? Part of speech of wi/wi+1 Is one of wi/wi+1 a dependent of the other? Are wi and wi+1 dependents of the same governor? Dependency relations that hold across splitpoint Height of wi/wi+1 in the dependency tree Difference in height of wi/wi+1 in dependency tree ψ(wi, ej) of all words left versus right of splitpoint Symmetric difference of best affinity scores for wi/wi+1 Are best affinity scores from the same event?
φ(w 1 , w 2 ) Figure 4: We use a trellis to allow for dynamic programming optimization of the objective function Splitpoint Classifier: All other bigrams besides those above are labeled as negative examples, and a linear SVM is fit to the data. The features for the classifier include structural, discourse, and statistical features. We make use of dependency parse information from the Stanford dependency parser (De Marneffe and Manning, 2008). The full features list is explained in Table 1.

Optimization
We want to maximize the function in Equation 1, and we have explained that we can approximate this by maximizing the factored form in Equation 2. By the above methods, we can produce values for the functions Ψ align and Φ seg . What remains is to optimize Equation 2. We take advantage of the factorization by using a dynamic programming approach to optimization. Figure 4 illustrates the setup. For each word w i of the utterance, we create a column of nodes in our trellis, with one row for each event e j ∈ E. The nodes represent the affinity of a given word w i with event e j . The weights on these nodes come from ψ atr. (w i , e j ) described in section 4.2.
The nodes in column w i are connected to the nodes in column w i+1 by edges whose weights   Table 3: Fragment-level alignments starting from raw data are drawn from the splitpoint classifier response φ(w i , w i+1 ). We label the edges between adjacent nodes corresponding to different events with the responses from the splitpoint classifier, and the inverse of these responses for edges connecting nodes corresponding to the same event.
We then use the Viterbi algorithm (Viterbi, 1967) to find the maximum scoring path through this trellis. The maximum scoring path optimizes Equation 2, and serves as our approximation of the optimization of Equation 1. We choose the top k diverse paths through the trellis and use the associations therein as our alignments. See Figure 5 for a detailed example of how our Viterbi path coincides with the responses from the attribute:value classifiers.

Experiments
One justification for multi-resolution language grounding would be if finer-resolution grounding improves coarser-resolution grounding and vice versa. If so, we expect that better utterance-level alignments will improve fragment-level alignments, and that in turn those fragment-level alignments will improve utterance-level alignments. We evaluate both of these hypotheses.

Experimental Setup
Dataset: We use the publicly available Professional Soccer Commentary (PSC) dataset introduced in Hajishirzi et al. (2012). This dataset is composed of professional commentaries from the 2010-2011 season of the English Premier League, along with a human-annotated data feed produced for each game by Opta Sportsdata (Opta, 2012) which describes all events occurring around the ball. Events include passes, shots, misses, cards,

Method
Precision Recall F1 Liang et al. (2009) 0.327 0.418 0.367 Hajishirzi et al. (2012) 0.355 0.576 0.439 Our approach 0.407 0.520 0.457 tackles, and other relevant game details. Each event category is defined precisely and the feed is annotated by professionals according to strict event description guidelines.
The PSC also provides ground truth alignment of full utterances to events in the data feed, and for this work we have augmented it with ground truth fragment-level annotations 5 .
We use data from 7 games of the PSC. These games consist of 778 utterances totaling 13,692 words. There are 12,275 events. This data is labeled with ground truth utterance-and fragmentalignments.
Metric: There are 1,295 correct utterance-toevent alignments. For evaluation we use precision, recall, and F1 of our utterance-level alignments. The evaluation of fragment-level alignments is less straight forward. This is due to the two features of a correct fragment alignment: picking the correct fragment boundaries and associating the fragment with the correct event. We evaluate fragment-level alignment on a per word basis. We consider precision in this task to be the number of correct word to event alignments versus the total number of alignments produced by a system. Recall is the number of correct word to event alignments versus the total gold word to event alignments, of which there are 18,147.
Comparisons: We compare to two previous works: Liang et al. (2009), which produces both segmentation and alignment results; and Hajishirzi et al. (2012), which produces state-of-theart alignments. When evaluating segmentation, we compare how well the systems perform starting from the raw dataset, and starting from gold utterance-level alignments. This allows us to isolate the segmentation process from the overall system architectures. It also gives us some insight into the effect of event priors on the segmentation and alignment processes.    Table 6: Ablation studies for utterance-level alignments by removing Ψ affinity and Φ seg from our model by replacing them with uniform function.

Results
We evaluate our method on its alignments at the fragment-level and at the utterance-level. The results are as follows: Fragment-level: Our results for segmentation can be seen in Tables 2 and 3. Table 2 shows the results achieved on the fragment-level alignment task using human-labeled utterance to event alignments.
In this setting, all and only the correct events for each utterance are present. Still, there are several ambiguities in the data. Some fragments are aligned in the gold data with multiple events, and some are aligned to no event. Our method outperforms the previous by a large margin in terms of both precision and recall. We show below how this is due to our system's accommodation of discourse structure when making segmentation decisions and the factored form of our optimization. Table 3 shows the results for fragment-level alignment by applying each system starting from the raw data. Here, in addition to the ambiguities mentioned above, the problem is further complicated by the fact that some correct events are missing from the alignments produced by each system and some incorrect events are included in these alignments (see Error Analysis below for details). Still our method achieves a significant improvement, with a 48% increase in F1 versus prior work. Table 5 shows ablation results for the effect of the factors used in our optimization for fragmentlevel alignments. These results demonstrate the value of each factor in the fragment-level alignment process. We cannot ascribe the benefit of this method to one factor or another alone -it is their concert that improves performance.

Utterance-level:
We have posited that good finerresolution alignments will improve the coarserresolution utterance to event alignments. Our results confirm this hypothesis. Table 4 shows our results on these alignments. We are able to improve F1 versus a state-of-the-art system which is tuned to maximize its F1 score. The majority of our improvement comes from the increased precision of our system, due to the influence of the finer-resolution fragment-level alignments on these coarser, utterance-level alignments. We provide a detailed example of this below. Ablation results are shown in Table 6.

Qualitative Analysis
A qualitative analysis of our system reveals the power of our factored objective, double-sided compositional approach, and leveraging of discourse structure. Figure 5 shows the best path through the trellis of the example sentence used in the introduction. For explanatory purposes, we have split every event into its three component attributes. This allows us to see how the attribute:value classifiers combine to produce an alignment.
Discourse Structure: The fragment-level alignment we have produced for this utterance is perfect: it correctly identifies the single splitpoint and correctly identifies each fragment with the associated event.
The identification of the splitpoint "and" comes from the fact that this word has, among other uses, a discourse connective meaning. Thus, the edges in our trellis between different events are weighted higher than edges between the same event in the edges between the nodes for "highest" and "and", encouraging the Viterbi path to change events at this point.

Compositionality:
We can see effect of the compositional approach we have taken -composing ψ affinity (s, e j ) from the attribute:value classifier scores of each ψ(w i , e a j ) -by looking at how the best path makes use of different attributes of the same event. For the "miss" event aligned with the second part of the sentence, we can see that the best path makes use of both values from the type:miss and pass event:head pass classifiers.
Affinities: A few interesting associations are worth pointing out. First, we note that the word "header" has a stronger affinity for the type:miss attribute than it does for the pass events:head pass attribute. On first blush, this seems like a mistake in our classifier. However, we can see that even in this single trellis all three events have the pass events:head pass attribute. The utterance-level alignment uses this association already, aligning utterances containing the word "header" with events that have a pass events:head pass attribute. At a finer-resolution, it is necessary to make a different distinction between events. Our method finds that the presence of the word "header" is a stronger indicator of an event with a type:miss attribute, and thus this association is made.
Words that are better for the coarser-resolution association with the pass events:head pass attribute are "towards" and "goal". Out of the 10 utterances containing the word "towards" in the dataset, 3 of these are aligned with at least 1 pass events:head pass event, making this strong association a correct one. The word "goal" also has an affinity for the pass events:head pass attribute due to the fact that many events with this attribute are attempts on goal. This correlates with domain knowledge about soccer, because, although there may be other uses of their head by a player in the game, shots on goal are events which will nearly always be commented upon by an announcer.
Factorization: We have shown that finerresolution fragment-level alignments can improve utterance-level alignments. From the exemplar SVMs, we are given an utterance-level alignment of the three events shown in the trellis with the utterance. This alignment is incorrect: the gold utterance alignment only includes the bottom two events. But by building an utterance-level alignment from the results of our fragment level alignment, we are left with only the two correct events. We prune the topmost event due to its failure to participate in a finer-resolution alignment.

Error Analysis
The majority of the errors made on our fragmentlevel alignments come in one of two flavors: Firstly, we sometimes erroneously identify a fragment as referring to an event when in truth it refers to no event. Commentators often describe facts about players or the weather or previous games which have no extension in the current game. However, our system cannot distinguish such language from the language referring to this game. This is a good avenue for future exploration.
The second set of errors we make in fragmentation are caused by bad event priors. Our current setup cannot increase recall: we can only improve the precision of the utterance-level alignments we are given. Therefore, if an event is overlooked in the first-pass of utterance-level alignments, we cannot reintroduce it through a fragment alignment. This is a direction for future work as well.

Related Work
Early semantic parsing work made use of fully supervised training (Zettlemoyer and Collins, 2005;Ge and Mooney, 2006;Snyder and Barzilay, 2007), but more recent work has focused on reducing the amount of supervision required (Artzi and Zettlemoyer, 2013). A few unsupervised approaches exist (Poon and Domingos, 2009;Poon, 2013), but these are specific to translating language into queries in highly structured database and cannot be applied to our more flexible domain.
There are few datasets as detailed as the Professional Soccer Commentary Dataset. Early work in understanding soccer commentaries focused on RoboCup soccer (Chen and Mooney, 2008;Chen et al., 2010;Bordes et al., 2010;Hajishirzi et al., 2011) where simple language describes each event, and events are in a one-to-one correspondence with utterances. Another dataset used for language grounding is the Weather Report Dataset (Liang et al., 2009). Here, again, however, we have mostly single utterances paired with single events, and many alignments are made via numerical string matching rather than learning lex-ical cues. The NFL Recap dataset (Snyder and Barzilay, 2007) is also laden with numerical fact matching, and does not include the fragment-level segmentation annotation that the PSC dataset provides.
Impressive advances have been made grounding language in instructions. Branavan et al. (2009) and Vogel and Jurafsky (2010) work in the domain of computer technical support instructions, mapping language to actions using reinforcement learning. Matuszek et al. (2012b) parses simple language to robot control instructions. Our work focuses on dealing with a richer space, both in terms of the language used and the worldrepresentation into which it is grounded, and leveraging the multiple resolutions of reference.
An exciting direction of research, closer to our own, aims to ground natural language in visual perception systems. Matuszek et al. (2012a) attempts to learn a joint model of language and object characteristics of a workplace environment. Yu and Siskind (2013) grounds moderately rich language in automatically annotated video clips. Again, the contribution of our work versus the above is in the complexity of the language with which we deal and our multi-resolution model.

Conclusion
The problem of grounding complex natural human language such as soccer commentaries is extremely difficult at all resolutions, and it is most challenging at finer resolutions where data is sparsest and small errors cannot be as easily normalized. Our work will help open new avenues of research into this difficult and exciting problem. This paper presents a new method for the multiresolution grounding of complex natural language in a detailed world representation. Our factor graph allows us to decompose the grounding problem into the more tractable subproblems of segmenting the language into fragments and aligning the fragments with the world representation. In the segmentation phase, we make use of linguistic theories of discourse to create a proxy of labels from which we learn statistical and structural features of good splitpoints. In the alignment phase, we bootstrap the learning of finer-grained correspondences between the language and the world representation with rough alignments from a stateof-the-art system. We combine these phases in a dynamic programming setup which allows us to efficiently optimize our objective.
We have shown that factoring the acquisition problem into separate alignment and segmentation phases improves performance on several evaluation metrics. We achieve considerable improvements over the previous state of the art on finerresolution alignments in the domain of professional soccer commentaries, and we show that we can leverage groundings at one resolution to improve alignments in another.
Several extensions of this work are possible. We would like to annotate more games to improve our dataset. We could improve our model by encoding the dynamics of the environment. We did not attempt to learn this information in our process, but it is likely that modeling the event transition probabilities could provide better results. A larger future work would extend the method outlined herein to produce templates for automated commentary generation.