Semantic Parsing Using Content and Context: A Case Study from Requirements Elicitation

We present a model for the automatic semantic analysis of requirements elicitation documents. Our target semantic representation employs live sequence charts , a multi-modal visual language for scenario-based programming, which can be directly translated into executable code. The architecture we propose integrates sentence-level and discourse-level processing in a generative probabilistic framework for the analysis and disambiguation of individual sentences in context. We show empirically that the discourse-based model consistently outperforms the sentence-based model when constructing a system that re-ﬂects all the static (entities, properties) and dynamic (behavioral scenarios) requirements in the document.


Introduction
Requirements elicitation is a process whereby a system analyst gathers information from a stakeholder about a desired system (software or hardware) to be implemented. The knowledge collected by the analyst may be static, referring to the conceptual model (the entities, their properties, the possible values) or dynamic, referring to the behavior that the system should follow (who does what to whom, when, how, etc). A stakeholder interested in the system typically has a specific static and dynamic domain in mind, but he or she cannot necessarily prescribe any formal models or code artifacts. The term requirements elicitation we use here refers to a piece of discourse in natural language, by means of which a stakeholder communicates their desiderata to the system analyst.
The role of a system analyst is to understand the different requirements and transform them into formal constructs, formal diagrams or executable code. Moreover, the analyst needs to consolidate the different pieces of information to uncover a single shared domain. Studies in software engineering aim to develop intuitive symbolic systems with which human agents can encode requirements that would then be unambiguously translated into a formal model (Fuchs and Schwitter, 1995;Bryant and Lee, 2002).
More recently,  defined a natural fragment of English that can be used for specifying requirements which can be effectively translated into live sequence charts (LSC) (Damm and Harel, 2001;Harel and Marelly, 2003), a formal language for specifying the dynamic behavior of reactive systems. However, the grammar that underlies this language fragment is highly ambiguous, and all disambiguation has to be conducted manually by a human agent. Indeed, it is commonly accepted that the more natural a controlled language fragment is, the harder it is to develop an unambiguous translation mechanism (Kuhn, 2014).
In this paper we accept the ambiguity of requirements descriptions as a premise, and aim to answer the following question: can we automatically recover a formal representation of the complete system -one that best reflects the humanperceived interpretation of the entire document? Recent advances in natural language processing, with an eye to semantic parsing (Zettlemoyer and Collins, 2005;Liang et al., 2011;Artzi and Zettlemoyer, 2013;Liang and Potts, 2014), use different formalisms and various kinds of learning signals for statistical semantic parsing. In particular, the model of Lei et al. (2013) induces input parsers from format descriptions. However, rarely do these models take into account the entire document's context.
The key idea we promote here is that discourse context provides substantial disambiguating information for sentence analysis. We suggest a novel Figure 1: An LSC scenario: "When the user clicks the button, the display color must change to red." model for integrated sentence-level and discourselevel processing, in a joint generative probabilistic framework. The input for the requirements elicitation task is given in a simplified, yet highly ambiguous, fragment of English, as specified in . The output, in contrast, is a sequence of unambiguous and well-formed live sequence charts (LSC) (Damm and Harel, 2001;Harel and Marelly, 2003) describing the dynamic behavior of the system, tied to a single shared code-base called a system model (SM).
Our solution takes the form of a hidden markov model (HMM) where emission probabilities reflect the grammaticality and interpretability of textual requirements via a probabilistic grammar and transition probabilities model the overlap between SM snapshots of a single, shared, domain. Using efficient viterbi decoding, we search for the best sequence of domain snapshots that has most likely generated the entire requirements document. We empirically show that such an integrated model consistently outperforms a sentence-based model learned from the same set of data.
The remainder of this document is organized as follows. In Section 2 we describe the task, and spell out our formal assumptions concerning the input and the output. In Section 3 we present our target semantic representation and a specially tailored notion of grounding for anchoring the requirements in a code-base. In Section 4 we develop our sentence-based and discourse-based models, and in Section 5 we evaluate the models on various case studies. In Section 6 we discuss applications and future extensions, and in Section 7 we summarize and conclude.

Parsing Requirements Elicitation Documents: Task Description
There is an inherent discrepancy between the input and the output of the software engineering process. The input, system requirements, is specified in a natural, informal, language. The output, the system, is ultimately implemented in a formal unambiguous programming language. Can we automatically recover such a formal representation of a complete system from a set of requirements? In this work we explore this challenge empirically.
The Input. We assume a scenario-based programming paradigm (a.k.a behavioral programming (BP) (Harel et al., 2012)) in which system development is seen as a process whereby humans describe the expected behavior of the system by means of "short-stories", formally called scenarios (Harel, 2001). We further assume that a given requirements document describes exactly one system, and that each sentence describes a single, possibly complex, scenario. The requirements we aim to parse are given in a simplified form of English (specifically, the English fragment described in ). Contrary to strictly formal specification languages, which are closed and unambiguous, this fragment of English employs an open-ended lexicon and exhibits extensive syntactic and semantic ambiguity. 1 The Output. Our target semantic representation employs live sequence charts (LSC), a diagrammatic formal language for scenario-based programming (Damm and Harel, 2001). Formally, LSCs are an extension of the well-known UML message sequence diagrams (Harel and Maoz, 2006), and they have a direct translation into executable code (Harel and Marelly, 2003). 2 Using LSC diagrams for software modelling enjoys the advantages of being easily learnable , intuitively interpretable (Eitan et al., 2011) and straightforwardly amenable to execution (Harel et al., 2002) and verification (Harel et al., 2013). The LSC language is particularly suited for representing natural language requirements, since its basic formal constructs, scenarios, nicely align with events, the primitive objects of Neo-Davidsonian Semantics (Parsons, 1990).
Live Sequence Charts and Code Artifacts. A live sequence chart (LSC) is a diagram that describes a possible or necessary run of a specified system. In a single LSC diagram, entities are represented as vertical lines called lifelines, and interactions between entities are represented using horizontal arrows between lifelines called messages, connecting a sender to a receiver. Messages may refer to other entities (or properties of entities) as arguments. Time in LSCs proceeds from top to bottom, imposing a partial order on the execution of messages. LSC messages can be hot (red, "must happen") or cold (blue, "may happen"). A message may have an execution status, which designates it as monitored (dashed arrow, "wait for") or executed (full arrow, "execute"). The LSC language also encompasses conditions and control structures, and it allows defining requirements in terms of the negation of charts. Figure 1 illustrates the LSC for the scenario "When the user clicks the button, the display color must change to red.". The respective system model (SM) is a code-base hierarchy containing the classes USER, BUTTON, DISPLAY, the method BUTTON.CLICK() and the property DISPLAY.COLOR.

Formal Settings
In the text-to-code generation task, we aim to implement a prediction function f : D → M, such that D ∈ D is a piece of discourse consisting of an ordered set of requirements D = d 1 , d 2 ...d n , and f (D) = M ∈ M is a code-base hierarchy that grounds the semantic interpretation of D; we denote this by M sem(d 1 , ..., d n ). We now define the objects D, M , and describe how to construct the semantic interpretation function (sem(.)). We then spell out the notion of grounding ( ).
Surface Structures: Let Σ be a finite lexicon and let L req ⊆ Σ * be a language for specifying requirements. We assume the sentences in L req have been generated by a context-free grammar G = N , Σ, S ∈ N , R , where N is a set of nonterminals, Σ is the aforementioned lexicon, S ∈ N is the start symbol and R is a set of context-free rules {A → α|A ∈ N , α ∈ (N ∪ Σ) * }. For each utterance u ∈ L req , we can find a sequential application of rules that generates it: u = r 1 • ... • r k ; ∀i : r i ∈ R. We call such a sequence a derivation of u. These derivations may be graphically depicted as parse trees, where the utterance u defines the sequence of tree terminals in the leaves.
We define T req to be the set of trees strongly generated by G, and utilize an auxiliary yield function yield : T req → L req returning the leaves of the given tree t ∈ .L req . Different parse-trees can generate the same utterance, so the task of analyzing the structure of an utterance u ∈ L req is modeled via a function syn : L req → T req that returns the correct, human-perceived, parse of u.
Semantic Structures: Our target semantic representation of a requirement d ∈ L req is a diagrammatic structure called a live sequence chart (LSC). The LSC formal definition we provide here is based on the appendix of Harel and Marelly (2003), but rephrased in set-theoretic, event-based, terms. We defined this alternative formalization in order to make LSCs compatible with Neo-Davidsonian, event-based, semantic theories. As a result, this form of LSC formalization is wellsuited for representing the semantics of natural language utterances.
Let us assume that L is a dictionary of entities (lifelines), A is a dictionary of actions, P is a dictionary of attribute names and V a dictionary of attribute values. The set of simple events in the LSC formal system is defined as follows: , temp, exe and l i ∈ L, a ∈ A, p i ∈ P, temp ∈ {hot, cold}, exe ∈ {executed, monitored}. The event e is called a message in which an action a is carried over from a sender l 1 to a receiver l 2 . 3 The set i=3 depicts a set of attribute:value pairs provided as arguments to action a. The temperature temp marks the modality of the action (may, must), and the status exe distinguishes actions to be taken from actions to be waited for.
An event e can also refer to a state, where a logical expression is being evaluated over a set of property:value pairs. We call such an event a condition, and specify the set of possible conditions as follows: Specifically, e = exp, {l : p : v} k i=0 , temp, exe is a condition to be evaluated, where l i ∈ L, p i ∈ P, v i ∈ V, temp ∈ {hot, cold} and exe ∈ {executed, monitored} are as specified above. The condition exp ∈ Exp is a first-order logic formula using the usual operators (∨, ∧, →, ¬, ∃, ∀). The set {l : p : v} k i=0 depicts a (possibly empty) set of attribute:value pairs that participates as predicates in exp. Executing a condition, that is, evaluating the logical expression specified by exp, also has a modality (may/must) and an execution status (performed/waited for).
The LSC language further allows us to define more complex events by combining partially ordered sets of events with control structures.
N is a set of non-negative integers, E cond is a set of conditions as described above, and each element E c , < is a partially ordered set of events. This structure allows us to derive three kinds of control structures: • e = 0, cond, E, < is a conditioned execution. If cond holds, E, < is executed.
Definition 1 (LSC) An LSC c = E, < is a partially ordered set of events such that Grounded Semantics: The information represented in the LSC provides the recipe for a rigorous construction of the code-base that will implement the program. This code-base is said to ground the semantic representation. For example, if our target programming language is an Object-Oriented programming language such as Java, then the code-base will include the objects, the methods and the properties that are minimally required for executing the scenario that is represented by the LSC. We refer to this code-base as a system model (henceforth, SM), and define it as follows.
Definition 2: (SM) Let L m be a set of implemented objects, A m a set of implemented methods, P m a set of arguments and V m argument values. We further define the auxiliary functions methods : A m → L m , props : P m → L m and values : V m → L m × P m , for identifying the entity l ∈ L m that implements the method a ∈ A m , the entity l ∈ L m that contains the property p ∈ P m , and the entity property l, p ∈ L m × P m that assumes that value v ∈ V m , respectively. A system model (SM) is a tuple m representing the implemented architecture.

methods, props, values
Analogously to interpretation functions in logic and natural language semantics, we assume here an implementation function, denoted [[.]], which maps each formal entity in the LSC semantic representation to its instantiation in the code-base. Using this function we define a notion of grounding that captures the fact that a certain code-base permits the execution of a given LSC c.
Definition 3(a): (Grounding) Let M be the set of system models and let C be the set of LSC charts. We say that m grounds c = E, < , and write m c, if ∀e ∈ E : m e, where: We have thus far defined how the semantics of a single LSC can be grounded in a single SM. In the real world, however, a requirements document typically contains multiple different requirements, but it is interpreted as a complete whole. The desired SM is then one that represents a single domain shared by all the specified requirements. Let us then assume a document d = d 1 , ..., d n containing n requirements, where ∀i : d i ∈ L req , and let be a unification operation that returns the formal unification of two SMs if such exists, and an empty SM otherwise. We define a discourse interpretation function sem(d) that returns a single SM for the entire document, where different mentions across sentences may share the same reference. The discourse interpretation function sem can be as simple as unifying all individual SMs for d i , and asserting that all elements that have the same name in different SMs refer to a single element in the overall SM. Or, it can be as complex as taking into account synonyms ("clicks the button" and "presses the button"), anaphora ("when the user clicks the button, it changes colour"), binding ("when the user clicks any button, this button is highlighted"), and so on. We can now define the grounding of an entire requirements document.
In this work we assume that sem(d) is a simple discourse interpretation function, where entities, methods, properties, etc. that are referred to using the same name in different local SMs refer to a single element in the overall code-base. This simple assumption already carries a substantial amount of disambiguating information concerning individual requirements. For example, assume that we have seen a "click" method over a "button" object in sentence i. This may help us disambiguate future attachment ambiguity, favoring structures where a "button" is attached to "click" over other attachment alternatives. Our goal is then to model discourse-level context for supporting the accurate semantic analysis of individual requirements.

Probabilistic Modeling
In this section we set out to explicitly model the requirement's context, formally captured as a document-level SM, in order to support the accurate disambiguation of the requirements' content. We first specify our probabilistic content model, a sentence-level model which is based on a probabilistic grammar augmented with compositional semantic rules. We then specify our probabilistic context model, a document-level sequence model that takes into account the content as well as the relation between SMs at different time points.

Sentence-Based Modeling
The task of our sentence-based model is to learn a function that maps each requirement sentence to its correct LSC diagram and SM snapshot. In a nutshell, we do this via a (partially lexicalized) probabilistic context-free grammar augmented with a semantic interpretation function.
More formally, given a discourse D = d 1 ...d n we think of each d i as having been generated by a probabilistic context-free grammar (PCFG) G. The syntactic analysis of d i may be ambiguous, so we first implement a syntactic analysis function syn : L req → T req using a probabilistic model that selects the most likely syntax tree t of each d individually. We can simplify syn(d), with d constant with respect to the maximization: Because of the context-freeness assumption, it holds that P (t) = r∈der(t) P (r), where der(t) returns the rules that derive t. The resulting probability distribution P : T req → [0, 1] defines a probabilistic language model over all requirements d ∈ L req , i.e., d∈Lreq t∈Treq,yield(t)=d P (t) = 1.
We assume a function sem : T → C mapping syntactic parse trees to semantic constructs in the LSC language. Syntactic parse trees are complex entities, assigning structures to the flat sequences of words. The principle of compositionality asserts that the meaning of a complex syntactic entity is a function of the meaning of its parts and their mode of combination. Here, the semantics of a tree t ∈ T req is derived compositionally from the interpretation of the rules in the grammar G. We overload the sem notation to define sem : R → C as a function assigning rules to LSC constructs (events or parts of events), 4 with• merging the resulting sets of events. Our sentence-based compositional semantics is summarized as: Here, it suffices to say that sem maps edges in the syntax tree to functions in the API of an existing LSC editor. For example: sem(N P → DET N N ) = f CreateObject(DET.sem, N N.sem). We specify the function sem in the supplementary materials. The code of sem is available as part of PlayGo (www.playgo.co).
For a single chart c, one can easily construct an implementation for every entity, action and property in the chart. Then, by design, we get an SM m such that m c. To construct the SM of the entire discourse in the sentence-based model we simply return f (d 1 , ..., d n ) = n i=1 m i where ∀i : m i sem(syn(d i )) and unifies different mentions of the same string to a single element.

Discourse-Based Modeling
We assume a given document D ∈ D and aim to find the most probable system model M ∈ M that satisfies the requirements. We assume that M reflects a single domain that the stakeholders have in mind, and we are provided with an ambiguous natural language evidence, an elicited discourse D, in which they convey it. We instantiate this view as a noisy channel model (Shannon, 1948), which provides the foundation for many NLP applications, such as speech recognition (Bahl et al., 1983) and machine translation (Brown et al., 1993).
According to the noisy channel model, when a signal is received it does not uniquely identify the message being sent. A probabilistic model is then used to decode the original message. In our case, the signal is the discourse and the message is the overall system model. In formal terms, we want to find a model M that maximises the following: We can simplify further, using Bayes law, where D is constant with respect to the maximisation. We would thus like to estimate two types of probability distributions, P (M ) over the source and P (D|M ) over the channel. Both M and D are structured objects with complex internal structure. In order to assign probabilities to objects involving such complex structures it is customary to break them down into simpler, more basic, events. We know that D = d 1 , d 2 , ..., d n is composed of n individual sentences, each representing a certain aspect of the model M . We assume a sequence of snapshots of M that correspond to the timestamps 1...n, that is: m 1 , m 2 , ..., m n ∈ M where ∀i : m i sem(d i ). The complete SM is given by the union of the different snapshots reflected in different requirements, i.e., M = i m i . We then rephrase: P (M ) = P (m 1 , ..., m n ) P (D|M ) = P (d 1 , ...., d n |m 1 , ..., m n ) These events may be seen as points in a high dimensional space. In actuality, they are too complex and would be too hard to estimate directly. We then define two independence assumptions. First, we assume that a system model snapshot at time i depends only on k previous snapshots (a stationary distribution). Secondly, we assume that each sentence i depends only on the SM snapshot at time i. We now get: Furthermore, assuming bi-gram transitions, our objective function is now represented as follows: Note that m 0 may be empty if the system is implemented from scratch, and non-empty if the requirements assume an existing code-base, which makes p(m 1 |m 0 ) a non-trivial transition.

Training and Decoding
Our model is in essence a Hidden Markov Model in which states capture SM snapshots, statetransition probabilities model transitions between SM snapshots, and emission probabilities model the verbal description of each state. To implement this, we need to implement a decoding algorithm that searches through all possible state sequences, and a training algorithm that can automatically learn the values of the still rather complex param- Training: We assume a supervised training setting in which we are given a set of examples annotated by a human expert. For instance, these can be requirements an analyst has formulated and encoded using an LSC editor, while manually providing disambiguating information. We are provided with a set of pairs {D i , M i } n i=1 containing n documents, where each of the pairs in i = 1..n is a tuple set {d ij , t ij , c ij , m ij } n i j=1 . For all i, j, it holds that t ij = syn(d ij ), c ij = sem(t ij ), and m ij sem(syn(d ij )). The union of the n i SM snapshots yields the entire model j m i j = M i , that satisfies the set of requirements M i sem(d i1 , ..., d in i ).
(i) Emission Parameters Our emission parameters P (d i |m i ) represent the probability of a verbal description of a requirement given an SM snapshot which grounds the semantics of the description. A single SM may result from different syntactic derivations. We calculate this probability by marginalizing over the syntactic trees that are grounded in the same SM snapshot.
The probability of P (t) is estimated using a treebank PCFG (Charniak, 1996), based on all pairs d ij , t ij in the annotated corpus. We estimate rule probabilities using maximum-likelihood estimates, and use simple smoothing for unknown lexical items, using rare-words distributions.
(ii) Transition Parameters Our transition parameters P (m i |m i−1 ) represent the amount of overlap between the SM snapshots. We look at the current and the previous system model, and aim to estimate how likely the current SM is given the previous one. There are different assumptions that may underly this probability distribution, reflecting different principles of human communication.
We first define a generic estimator as follows: where gap(.) quantifies the information sharing between SM snapshots. Regardless of the implementation of gap, it can be easily shown that P is a conditional probability distribution wherê P : M × M → [0, 1] and, for all m i , m j , : m jP (m i |m j ) = 1. (For efficiency reasons, we consider M to be a restricted universe that is considered be the decoder, as specified shortly.) We define different gap implementations, reflecting different assumptions about the discourse. Our first assumption here is that different SM snapshots refer to the same conceptual world, so there should be a large overlap between them. We call this the max-overlap assumption. A second assumption is that, in collaborative communication, a new requirement will only be stated if it  Table 1: Quantifying the gap between snapshots. set(m i ) is a set of nodes marked by path to root.
provides new information, akin to Grice (1975). This is the max-expansion assumption. An additional assumption prefers "easy" transitions over "hard" ones, this is the min-distance assumption. The different gap calculations are listed in Table 1.
Decoding An input document contains n requirements. Our decoding algorithm considers the N-best syntactic analyses for each requirement. At each time step i = 1...n we assume N, states representing the semantics of the N best syntax trees, retrieved via a CKY chart parser. Thus, setting N = 1 is equal to a sentence-based model, in which for each sentence we simply select the most likely tree according to a probabilistic grammar, and construct a semantic representation for it.
For each document of length n, we assume that our entire universe of system models M is composed of N × n SM snapshots, reflecting the N most-likely analyses of n sentences, as provided by the probabilistic syntactic model. (As shall be seen shortly, even with this restricted 5 universe approximating M, our discourse-based model provides substantial improvements over a sentencebased model).
Our discourse-based model is an HMM where each requirement is an observed signal, and each i = 1..N is a state representing the SM that grounds the i th best tree. Because of the Markov independence assumption our setup satisfies the optimal subproblem and overlapping problem properties, and we can use efficient viterbi decoding to exhaustively search through the different state sequences, and find the most probable sequence that has generated the sequence of requirements according to our discourse-based probabilistic model.
The overall complexity decoding a document with n sentences of which max length is l, using a grammar G of size |G| and a fixed N , is given by: We can break this expression down as follows: (i) In O(n × l 3 × |G| 3 ) we generate N best trees for each one of the n requirements, using a CKY chart (Younger, 1967). (ii) In O(l 2 × N 2 × n) we create the universe M based on the N best trees of the n requirements, and calculate N × N transitions. (iii) In O((N ×n) 2 ×n) = O(N 2 ×n 3 ) we decode the n × N grid using Viterbi (1967) decoding.

Experiments
Goal. We set out to evaluate the accuracy of a semantic parser for requirements documents, in the two modes of analysis presented above. Our evaluation methodology is as standardly assumed in machine learning and NLP: given a set of annotated examples -that is, given a set of requirements documents, where each requirement is annotated with its correct LSC representation and each document is associated with a complete SM -we partition this set into a training set and a test set that are disjoint. We train our statistical model on the examples in the training set and automatically analyze the requirements in the test set. We then compare the predicted semantic analyses of the test set with the human-annotated (henceforth, gold) semantic analyses of this test set, and empirically quantify our prediction accuracy.
Metrics. Our semantic LSC objects have the form of a tree (reflecting the sequence of nested events in our scenarios). Therefore, we can use standard tree evaluation metrics, such as ParseEval (Black et al., 1992), to evaluate the accuracy of the prediction. Overall, we define three metrics to evaluate the accuracy of the LSC trees: POS: the POS metric is the percentage of part-of-speech tags predicted correctly. LSC-F1: F1 is the harmonic means of the precision and recall of the predicted tree. LSC-EM: EM is 1 if the predicted tree is an exact match to the gold tree, and 0 otherwise.
In the case of SM trees, as opposed to the LSC trees, we cannot assume identity of the yield between the gold and parse trees for the same sen-   Table 3: Sentence-Based modeling: Accuracy results on the Phone development set.
tence, 6 so we cannot use ParseEval. Therefore, we implement a distance-based metrics in the spirit of Tsarfaty et al. (2012). Then, to evaluate the accuracy of the SM, we use two kinds of scores: SM-TED: TED is the normalized edit distance between the predicted and gold SM trees, subtracted from a unity. SM-EM: EM is 1 if the predicted SM is an exact match with the gold SM, 0 otherwise.

Data.
We have a small seed of correctly annotated requirements-specification case studies that describe simple reactive systems in the LSC language. Each document contains a sequence of requirements, each of which is annotated with the correct LSC diagram. The entire program is grounded in a java implementation. As training data, we use the case studies provided by . Table 2 lists the case studies and basic statistics concerning these data. As our annotated seed is quite small, it is hard to generalize from it to unseen examples. In particular, we are not guaranteed to have observed all possible structures that are theoretically permitted by the assumed grammar. To cope with this, we create a synthetic set of examples using the grammar of  in generation mode, and randomly generate trees t ∈ T req .
The grammar we use to generate the synthetic examples clearly over-generates. That is to say, it creates many trees that do not have a sound interpretation. In fact, only 3000 our of 10000 generated examples turn out to have a sound semantic interpretation grounded in an SM. Nonetheless, these data allow us to smooth the syntactic distributions that are observed in the seed, and increase the coverage of the grammar learned from it. In our next experiment, we provide empirical upper-bounds and lower-bounds for the discoursebased model. Table 4 presents the results of the discourse-based model for N > 1 on the Phone example. Gen-Only presents the results of the discourse-based model with a PCFG learned from synthetic trees only, incorporating transitions obeying the max-overlap assumption. Already here, we see a mild improvement for N > 1 relative to the N = 1 results, indicating that even a weak signal such as the overlap between neighboring sentences already improves sentence disambiguation in context. We next present the results of an Oracle experiment, where every requirement is assigned the highest scoring tree in terms of LSC-F1 with respect to the gold tree, keeping the same transitions. Again we see that results improve with N , indicating that the syntactic model alone does not provide optimal disambiguation. These results provides an upper bound on the parser performance for each N . Gen+Seed presents results of the discourse-based model where the PCFG interpolates the seed set and the synthetic train set, with max-overlap transitions. Here, we see larger improvements over the synthetic-only PCFG. That is, modeling grammaticality of individual sentences improves the interpretation of the document.

Results.
Next we compare the performance for different implementations of the gap(m i , m j ) function. We estimate probability distributions that reflect each of the assumptions we discussed, and add an additional method called hybrid, in which we interpolate the max-expansion and max-overlap estimates (equal weights). In Table 5, the trend from the previous experiment persists. Notably, the hybrid model provides a larger error reduction than its components used separately, indicating that in order to capture discourse context we may need to balance possibly conflicting factors. In no emissions we rely solely on the probability of state transitions, and again increasing N leads to improvement. This result confirms that context is indispensable for sentence interpretationeven when probabilities for the sentence's seman-  We finally perform a cross-fold experiment in which we leave one document out as a test set and take the rest as our seed. The results are provided in Table 6. The discourse-based model outperforms the sentence-based model N = 1 in all cases. Moreover, the drop in N = 128 for Phone seems incidental to this set, and the other cases level off beforehand. Despite our small seed, the persistent improvement on all metrics is consistent with our hypothesis that modeling the interpretation process within the discourse has substantial benefits for automatic understanding of the text.

Applications and Discussion
The statistical models we present here are applied in the context of PlayGo, 7 a comprehensive tool for behavioral, scenario-based, programming. PlayGo now provides two modes of playing-in natural language requirements: interactive play-in, where a user manually disambiguates the analyses of the requirements , and statistical play-in, where disambiguation decisions are taken using our discourse-based model.
The fragment of English we use is very expressive. It covers not only entities and predicates, but also temporal and aspectual information, modalities, and program flow. Beyond that, we assume an open-ended lexicon. Overall, we are not only translating English sentences into executable LSCs -we provide a fully generative model for translating a complete document (text) into a complete system model (code).
This text-to-code problem may be thought of as a machine translation (MT) problem, where one aims to translate sentences in English to the formal language of LSCs. However, standard statistical MT techniques rely on the assumption that textual requirements and code are aligned at a sentence level. Creating a formal model that aligns text and code on a sentence-by-sentence basis is precisely our technical contribution in Section 3.
To our knowledge, modeling syntax and discourse processing via a fully joint generative model, where a discourse level HMM is interleaved with PCFG sentence-based emissions, is novel. By plugging in different models for p(d|m), different languages may be parsed. This method may further be utilized for relating content and context in other tasks: parsing and documentlevel NER, parsing and document-level IE, etc. To do so, one only needs to redefine the PCFG (emissions) and state-overlap (transition) parameters, as appropriate for their data. 8

Conclusion
The requirements understanding task presents an exciting challenge for CL/NLP. We ought to automatically discover the entities in the discourse, the actions they take, conditions, temporal constraints, and execution modalities. Furthermore, it requires us to extract a single ontology that satisfies all individual requirements. The contributions of this paper are three-fold: we formalize the textto-code prediction task, propose a semantic representation with well-defined grounding, and empirically evaluate models for this prediction. We show consistent improvement of discourse-based over sentence-based models, in all case studies. In the future, we intend to extend this model for interpreting requirements in un-restricted, or lessrestricted, English, endowed with a more sophisticated discourse interpretation function.