Proof Net Structure for Neural Lambek Categorial Parsing

In this paper, we present the first statistical parser for Lambek categorial grammar (LCG), a grammatical formalism for which the graphical proof method known as *proof nets* is applicable. Our parser incorporates proof net structure and constraints into a system based on self-attention networks via novel model elements. Our experiments on an English LCG corpus show that incorporating term graph structure is helpful to the model, improving both parsing accuracy and coverage. Moreover, we derive novel loss functions by expressing proof net constraints as differentiable functions of our model output, enabling us to train our parser without ground-truth derivations.


Introduction
In the family of categorial grammars, combinatory categorial grammar (CCG) has received by far the most attention in the computational linguistics literature. There exist algorithms for both mildly context-sensitive (e.g., Kuhlmann and Satta, 2014) and context-free (typically CKY; Cocke and Schwartz, 1970;Kasami, 1966;Younger, 1967) CCG parsing, and there has been much research on statistical CCG parsers (e.g., Clark and Curran, 2007;Lewis et al., 2016;Stanojević and Steedman, 2020). Another member of the categorial family, Lambek categorial grammar (LCG), has been less well-explored: LCG work has been primarily theoretical or focused on non-statistical parsing.
The recent lack of attention is likely due to two notable results: (1) LCG is weakly context-free equivalent (Pentus, 1997); and (2) LCG parsing is NP-complete (Pentus, 2006;Savateev, 2012). However, neither of these issues is necessarily practically relevant. Moreover, LCG presents a number of advantages and interesting properties. For example, LCG provides even greater syntax-semantics transparency than is the case for most CCG parsers because it does not invoke non-categorial rules, maintaining a consistent parsing framework. LCG's rules together define a calculus over syntactic categories that is a subset of linear logic (Girard, 1987). LCG, like CCG or LTAG, is a highly lexicalized formalism: lexical categories encode substantial syntactic information, and as a result are themselves complex and structured. Despite this, the inner structure of the categories has not been strongly considered in parsers beyond evaluating the category for compatibility with a grammatical rule.
In this paper, we present the first statistical LCG parser. Unlike past parsers for CCG or LTAG, our parser explicitly incorporates structural aspects of the grammar. We base our system on proof nets, a graphical method for representing linear logic proofs that abstracts over irrelevant aspects, such as the order of application of logical rules (Girard, 1987;Roorda, 1992). This corresponds to the problem of spurious ambiguity, making proof nets an attractive choice for representing derivations.
Our work has two primary contributions. First, we introduce a self-attention-based LCG parsing model that incorporates proof net structure in multiple ways. We find that minding proof net structure enables us to define a model that is differentiable through this categorial structure down to the atomic categories of the grammar, improving parsing accuracy and coverage on an English LCG corpus.
Second, proof net constraints allow us to define novel grammatico-structural loss functions that can be used as training objectives. This enables us to train a parser without ground-truth derivations that has high coverage and even frequently includes the correct parse among the parses that it finds. Our analysis shows that all of our components contribute to the parser's performance, but that planarity information is especially important. Figure 1: The rules of the associative Lambek calculus without product and allowing empty premises.

Lambek categorial grammar
Lexical categories in LCG, like those of CCG, comprise an infinite set of categories that is formed by the closure of two binary connectives, the forward (/) and backward slash (\), on a small set of atomic (i.e., primitive) categories, such as S and NP for sentences and noun phrases. The connectives both create functional categories, and they differ in which of a word or phrase a specified argument must appear. For example, (S\NP)/NP/NP represents a category that combines with two NPs to its right and one NP to its left to yield a valid S.1 In English, this category might represent a ditransitive verb. Figure 1 shows the rules of inference for L*, the associative Lambek calculus without product and allowing empty premises. In L*, statements, called sequents, have (ordered) lists of categories as antecedents on the left of the turnstile and single categories as consequents on the right. The interpretation of a sequent is that its consequent can be derived from its antecedents. The rules have their premises above a bar, conclusions below, and a label for the rule to the right. Rules / e and \ e eliminate a slashed category, in that it is missing from their conclusions; rules / i and \ i introduce a new slashed functor in the consequent of their conclusions.
Each rule states that its concluding sequent is true (derivable) if and only if all of its premises are. and are variables over categories (atomic or complex) while Δ and Γ are variables over possibly-empty2 lists of categories. A typical application of LCG is to look up a category for each word in a sentence and then inquire whether the consequent S is derivable from the antecedent that lists these categories in the same order as their words.
While some of CCG's rules are not derivable in the Lambek calculus (inter alia, crossing com-1Although LCG's usual notation employs these connectives slightly differently, we use CCG notation here. 2There are calculus variants that disallow such empty lists. position and substitution), first-degree harmonic composition and type-raising are. At the same time, LCG's introduction rules cannot be derived by any CCG with finite rules (Zielonka, 1981). Although LCG parsing is known to be an NPcomplete problem (Pentus, 2006;Savateev, 2012), Fowler (2010 presented an algorithm that is exponential only in category order, a quantity that is bounded to small values in practice (Fowler, 2016).

Term graphs: enhanced proof nets
Our work in this paper is based on a variety of proof net known as term graphs. A term graph is a digraph that represents a sequent proof in the Lambek calculus. The atoms of the sequent correspond to vertices in the graph, and the internal structure of the lexical categories is represented by regular edges and Lambek edges between the vertices. Together, the vertices, regular edges, and Lambek edges are referred to as a proof frame, which is invariant across possible proofs for the sequent. A proof is represented by a proof frame plus by an additional set of regular edges between the vertices called a linkage. Different linkages correspond to different proofs, which in turn correspond to different syntactic parses. For a term graph to be valid, the frame-plus-linkage is subject to certain conditions, detailed below.3 To construct a proof frame for a sequent 1 , 2 , . . . , ⊢ , the categories in the sequent are first assigned positive or negative polarities. Each lexical category of the antecedent is marked negative ( − ), while the consequent is marked positive ( + ). Each polarized category is decomposed into its polarized atoms according to a set of recursively-applied rules. These rules also specify the regular and Lambek edges between the atoms, represented as solid and dashed edges, respectively. The lexical decomposition rules are: The total order of the frame (indicated left-to-right) is determined by the ordering of the lexical categories in the sequent together with the ordering specified in the decomposition rules above. A linkage for a proof frame consists of directed edges called links from positive vertices to negative vertices of the same atomic category. Valid linkages form perfect matchings: each vertex in the frame 3See (Fowler, 2009(Fowler, , 2016 for full details. S/(S\NP) (S\NP)/PP PP/NP NP/N What accounts for the Figure 2: An example term graph. Dotted vertical lines delimit polarized atoms within a word; solid vertical lines mark lexical boundaries. The linkage is shown above the atoms; the proof frame edges are shown below them, with solid regular edges and dashed Lambek edges. The consequent category is aligned with sentence-final punctuation for convenience. This single term graph represents multiple spuriously ambiguous derivations.
has exactly one link, and that link is outgoing for positive vertices and incoming for negative vertices. A term graph represents a proof in L* (Fowler, 2009), and therefore also an LCG parse, so long as it meets the following conditions: T1. The linkage is half-planar; i.e., the links can be drawn above the linearly-ordered vertices without crossing. T2. Treating links as regular edges, the graph is regular-acyclic; i.e., there are no cycles containing only regular edges. T3. For each Lambek edge ⟨ , ⟩, there exists a regular path from to . A term graph that satisfies these conditions is called L*-integral. Figure 2 shows an example term graph.

Neural network LCG parsing
For categorial and other highly-lexicalized grammatical formalisms, the standard approach to statistical syntactic parsing separates the problem into two steps: (1) a supertagger assigns lexical categories to the words in the input sentence; then (2) a parser uses the supertagger's predictions to produce a predicted parse for the sentence. With proof nets, the lexical categories uniquely determine the proof frame, so supertagging can be seen as predicting a proof frame for the sentence. The second step then corresponds to predicting the linkage for the proof frame. Of course, the linkage must, together with the proof frame, yield an L*-integral term graph.
Our work in this paper focuses on the latter component. Our parser is constructed such that we can separate its aspects that incorporate term graph structure and constraints from a "baseline" model which uses almost no such information. To provide a broad overview, our baseline model runs a Transformer encoder stack (Vaswani et al., 2017) over the proof frame vertices. The top encoder block is truncated, omitting everything after and including the softmax, which directly yields scores for every pair; we mask these scores so that only valid links (i.e., those from positive vertices to negative vertices of the same atomic category) are considered.
We next detail our baseline model in Section 3.1 and then our various methods for incorporating term graph structure in Sections 3.2-3.4. Note that we use named tensor notation (Chiang et al., 2021) in the mathematical descriptions.

Parser inputs
Our baseline model takes as input a sentence, an associated proof frame, and an alignment between the words in the sentence and the polarized atoms that are the proof frame's vertices. We represent each word as a vector of size |vec| so that a sentence of length |words| is represented as a matrix ∈ R words × vec . For a grammar with atomic categories the proof frame vertices are thus each represented as one-hot vectors of width |pt| = |P T | = 2|T |. A proof frame with |vtx| vertices is represented by stacking its vertex's vectors, forming a matrix ∈ {0, 1} vtx × pt with pt = 1. The vtx axis is ordered according to the vertices' total order. Finally, the word-vertex alignment is represented as a matrix ∈ {0, 1} vtx × words where vtx( ),words( ) = 1 if and only if vertex corresponds to word .

Transformer encoder stack
The Transformer encoder stack as defined by Vaswani et al. (2017) adds positional encoding vectors to the model inputs. In our case, we have two input sequences (polarized atoms and word vectors) of differing lengths, along with an alignment between them. We include the positional encoding vectors over the word positions as inputs to the encoder, and apply relative positional attention (Dai et al., 2019) during the self-attention step over the polarized atom positions. We found this combination most effective during development.
More precisely, we add the usual sinusoidal positional encoding vectors (Vaswani et al., 2017) w ∈ R words × vec to the word vectors and map the result to the vertex indices. We embed the polarized atoms via trainable matrix ∈ R pt × vec and add them to their corresponding word vectors to yield inputs 0 ∈ R vtx × vec for the encoder stack: The Transformer encoder stack consists of encoder layers. Denoting the input to layer as −1 , with 0 as above, each layer computes as: with FFN defined as in (Vaswani et al., 2017).4 From a given input sequence, standard selfattention computes query, key, and value tensors. Although all three tensors derive from the same input sequence, the key and value tensors function as "memory" tensors, so their sequence axis is a "lookup" axis distinct from that of the query tensors. In our case, the input sequence axis is vtx, so we preserve this distinction by renaming the vtx axis to vtx ′ for the key and value tensors.
We employ relative positional encoding following Dai et al. (2019), allowing our model to directly learn to attend to polarized atoms at positions relative to a given atom. The relative positional vectors are represented as a tensor is the encoding of position (on the key/value axis) relative to position (on the query axis).
For multi-headed self-attention, each encoder layer computes |heads| attention heads, each of width |hdim|. We thus have trainable parameters q, , k, , v, , r, o, ∈ R heads × hdim × vec and k, , r, ∈ R heads × hdim with which we compute the query, key, value, relative position encoding, and attention score tensors , , , , and as: 4We apply layer normalization (Ba et al., 2016) as well, but omit it here for concision. We use the "pre-norm" application order (Wang et al., 2019;Nguyen and Salazar, 2019) where vtx → vtx ′ denotes renaming axis vtx to vtx ′ . Next, with final trainable parameter o, ∈ R heads × hdim × vec , we compute the SelfAttn output: Each encoder layer includes all of these steps except for the final layer = , where we omit Equation 4. Finally, we apply a mask ∈ {0, ∞} vtx × vtx ′ to ensure that only edges from positive atoms to negative atoms of the same category are considered: where atom( ) returns the category of vertex and: if vertex is negative.

Linkage loss function
Given the predicted score matrix , it still remains to specify how to predict candidate linkages. We first note the problem with which we are presented at this stage is exactly that of finding the max-weight (or min-cost) perfect bipartite matching. Ideally, will provide scores that, when optimized over, yield the desired matching, i.e., the ground-truth linkage.
As we aim to train our parser on a corpus with ground-truth linkages using gradient descent, our perfect matching algorithm must be differentiable for training so that gradients can be backpropagated through it from the loss function; we use Sinkhorn's algorithm (Sinkhorn and Knopp, 1967) with temperature, also known as SoftAssign (Kosowsky and Yuille, 1994;Gold and Rangarajan, 1996b). The procedure, which alternates normalizing the rows and columns of exp( / ), > 0, converges to a doubly-stochastic matrix. In the limit of → 0, this converges to the optimal matching (Mena et al., 2018), thereby providing a means of computing the optimal linkage. Moreover, with the addition of standard Gumbel noise, can be seen to parameterize a distribution over permutation matrices, with the Sinkhorn operator then functioning as a means of sampling from this distribution. Importantly for training with gradient descent, the operations of Sinkhorn's algorithm are fully differentiable. For a given doubly-stochastic output matrix from Sinkhorn's algorithm, the negative log likelihood NLL of the ground-truth linkage L = ⟨ 1 , 1 ⟩, ⟨ 2 , 2 ⟩, . . . , ⟨ |vtx |/2 , |vtx|/2 ⟩ is a natural choice of loss function for training: Our base model uses this loss function and is trained with the ground-truth linkages as targets.

Modelling term graph structure
The model just described is a straightforward application of attention scores and Sinkhorn's algorithm to the problem of finding linkages for a proof frame. However, the only place where the structured nature of the proof net is exploited is in the polarity and category restrictions on the matching; other relevant characteristics are not directly taken into account, such as the proof frame or the validity conditions. Given that the validity conditions cannot even be evaluated without the proof frame edges, we hypothesize that including knowledge of the proof frame structure will help the model to select valid linkages, or even the correct one. Similarly, encoding information about the validity conditions themselves may also be beneficial. We therefore incorporate term graph structure into our model in a number of ways, which we now describe.

Regular and Lambek edges
As is, the parser does not have knowledge of the internal structure of the lexical categories; while it receives as input the atomic category and polarity of each vertex in the proof frame, it has no knowledge of the regular and Lambek edges. We hypothesize that incorporating this structure will boost parser performance, as the edges provide crucial information about which links combinations fail to satisfy the validity conditions for term graphs.
To encode these edges in the parser, we alter the inputs to the encoder's attention blocks. We represent the regular edges as an adjacency matrix R ∈ {0, 1} vtx × vtx ′ where [ R ] vtx( ),vtx ′ ( ) = 1 if and only if the proof frame has a regular edge from (negative) vertex to (positive) vertex . Lambek edges are represented similarly as an adjacency matrix L . For each encoder layer , we introduce four new transformation matrices q,R, , q,L, , k,R, , k,L, ∈ R heads × hdim × vec and alter Equations 1 and 2: Per adjacency matrix, this alteration first computes a transformation of the input for both the query and key aspects of the self-attention transformation. Multiplying by the adjacency matrix then, for each vertex , sets the value to be equal to the sum of the values of either 's out-neighbours or 's in-neighbours, depending on the particular term. This serves as a form of message passing along the graph edges, similar to some methods in graph-based neural networks (Gilmer et al., 2017).

Planarity-aware attention
Condition T1 requires that term graph linkages be half-planar. We include planar crossing information in the attention scores in Equation 3 for each vertex pair by subtracting the mean attention score of conflicting vertex pairs. More formally, let X denote the set of vertex pairs between which a link would cross with a link between vertex pair ( , ) in the half-plane above the linearly ordered vertices of the term graph. Then we adjust as follows:

Edge filtering
In Equation 5, a mask is applied to the candidate link scores produced by the model to enforce the category and polarity constraints. We augment this mask in two ways. First, we disallow necessarily non-planar links, i.e., links that cannot be in any planar linkage. Penn (2004) defined a contextfree grammar for building planar linkages; we use this CFG with the inside-outside algorithm (Baker, 1979) to identify whether candidate links each exist in any spanning planar linkage. Second, we disallow intra-word links, i.e., links between any two vertices that map to the same word. Some of these links are permissible according to the rules of L*, but do not occur in our corpus; moreover, their linguistic utility is unclear. Overall, we expect that these restrictions on allowable links will help reduce the size of the search space, thereby improving system performance.
Inspecting Figure 2 exemplifies how these extra filters can be useful. Disallowing necessarily nonplanar links eliminates a candidate link from the NP + of "for" to the NP − of "What" as it would prevent the NP + and S − of "accounts" as well as the NP − of "the" each from having any planar links. This then implies that there must be a link from the NP + of "for" to the NP − of "the", since that is the only remaining option. Similarly, disallowing the S + of "What" from linking to its own S − immediately implies that it must then link to the S − of "accounts" while also preemptively preventing a violation of condition T2.

-best linkages
While Sinkhorn (with Gumbel noise) provides differentiable sampling of matchings, it has two noteworthy drawbacks. First, it sometimes does not converge to an exact permutation and gets stuck with some values very close to 0.5 (Guigues, 2020), requiring some means of or discretizing such cases (e.g., Gold and Rangarajan, 1996a). This does not pose a potential issue for our parser during training, since NLL does not require a permutation matrix. During inference, however, the parser needs to be able to produce a discrete result as its output parse. Second, Sinkhorn makes it difficult at best to retrieve multiple matchings from the distribution. Without Gumbel noise (and with sufficiently small ), it will converge to the best permutation, but one cannot specifically retrieve the second-best (etc.) matchings from this. Sampling (via the addition of Gumbel noise) may yield multiple matchings, but there is no guarantee of their overall rank; furthermore, if the input matrix represents a very con-centrated distribution, retrieving further matchings may require inordinate sampling rounds.
Since there can be multiple valid parses for a sentence, a parser should ideally be able to return multiple parses if they exist. Moreover, there is no guarantee that the predicted linkage in Equation 6 will yield an L*-integral term graph, so it is worthwhile to be able to evaluate alternatives. We therefore use Murty's algorithm (Murty, 1968), a -best optimal matching algorithm, to produce candidate linkages from . We stably sort the linkages according to the number of term graph conditions that they violate when combined with the input proof frame, allowing fractional violations of condition T3. Enabling the production of multiple candidate parses also makes it possible for the parser to return the correct parse when it otherwise might not have done so.

L* structural loss
In contrast to other statistical parsers, the system presented thus far does not have any explicit encoding of the rules of the grammar. Since the negative log-likelihood loss function (Equation 7) is based only on the ground-truth linkage, it is not clear how well the model will be able to generalize and return multiple valid linkages when applicable, rather than linkages most similar to the correct one. We therefore introduce loss function terms that directly encode the term graph validity conditions, and posit that they will help the parser produce linkages that, with the input proof frame, yield an L*-integral term graph. These novel loss functions also enable training our model without ground-truth derivations.
For condition T1, we define the planarity loss function T1 as a function of the post-Sinkhorn matrix from Equation 6 so that each link in each pair of crossing links is penalized in proportion to the scores given to the pair: where X is defined as in Equation 8. Minimizing T1 then corresponds to minimizing the scores assigned to crossing links.
The remaining loss terms require further computation. Note that conditions T2 and T3 both express constraints on the (non-)existence of certain regular paths; the latter is already stated as such while the former can be equivalently restated as barring regular paths from any vertex to itself. Checking for the existence or absence of paths between two vertices of a graph requires traversing graph edges. As graph traversal corresponds to multiplication by the graph's adjacency matrix, this presents a differentiable means of computing the extent to which a candidate term graph meets conditions T2 and T3.
For an arbitrary weighted graph with vertices V, denote by W , , the set of all walks of length from vertex to vertex . By definition, each walk ∈ W , , is a sequence of edges, i.e., = (⟨ 1 , 1 ⟩, ⟨ 2 , 2 ⟩, . . . , ⟨ , ⟩), 1 = , = . Let denote the adjacency matrix of ; then the matrix power represents the sum (over walks) of walk edge products, i.e., ( ) , = Returning to term graphs, for condition T2 this means that we can detect regular cycles in a candidate graph with |V | vertices by constructing an adjacency matrix ∈ R | V | × | V | from the regular edges of the proof frame together with the candidate linkage, computing | V |

=1
, and then verifying that the diagonal entries are all zero. For condition T3, we can similarly inspect the entries corresponding to the parent and child nodes of all Lambek edges and verify that they are all one, indicating that a regular path exists. We can now see how to specify these conditions as loss functions: Minimizing T2 and T3 correspond to minimizing violations of conditions T2 and T3, respectively.
We refer to the sum of these three loss functions L* = T1 + T2 + T3 as the (L*) structural loss.

Related work
A key aspect of our parser is that it makes use of a structured decomposition of lexical categories in categorial grammars. In this sense, our work follows up on the intuition of recent "constructive" supertaggers, which have been explored for a type-logical grammar (Kogkalidis et al., 2019) and for CCG (Bhargava and Penn, 2020;Prange et al., 2021). Such supertaggers construct categories out of the atomic categories of the grammar; this challenges the classical approach to supertagging, where lexical categories are treated as opaque, rendering the task of supertagging equivalent to large-tagset POS tagging. With this view, it becomes possible for novel categories to be produced; furthermore, the supertaggers are better able to incorporate prediction history and thereby produce grammatical outputs (Bhargava and Penn, 2020).
Recently, Kogkalidis et al. (2020) proposed a system for parsing a "type-logical" grammar that is essentially a modal, non-directional extension of LCG. The Dutch grammar they used is substantially different from our grammar: their connectives are both modal and non-directional; in addition, they have far more atomic categories. While their model is similar to our baseline (Section 3.1), our work here differs substantially in that we incorporate proof-net structural elements and validity conditions, and our system is able to return multiple linkages (Sections 3.2-3.4). Our approach also enables ground-truth-free training.5 Lastly, our trained parser operates in polynomial time. Since LCG parsing is NP-complete, our work adds to the body of recent work applying neural networks to NP-hard combinatorial optimization problems to yield polynomial-time approximate solvers (e.g., Li et al., 2018;Gannouni et al., 2020;Sultana et al., 2020;Cappart et al., 2021).

Data
We train our models on LCGbank, a semi-automatic conversion of CCGbank to LCG (Fowler, 2016). This conversion necessitated adjusting for instances of CCG's crossing rules that are not permitted in LCG, as well as providing fully categorial parses for the cases in CCGbank where non-categorial rules are used (e.g., unary type-changing).6 LCGbank also omits features on its categories and includes 5Although Kogkalidis et al. (2020) describe their model's training as "end-to-end", their approach is perhaps better described as joint training. A truly end-to-end system would allow differentiation through the supertagger/proof frame construction, which remains a topic for further investigation.
6Refer to (Fowler, 2016) for further conversion details. the 274 sentences that were originally excluded from CCGbank. These adjustments substantially increase the number of lexical categories in LCGbank compared to CCGbank. Without the features, CCGbank has 476 unique lexical categories, while LCGbank has 987. Decomposed, the categories yield 10 atomic categories, shown in Table 1. We follow the CCGbank/PTB tradition of using section 0 for development/validation and section 23 for testing, yielding 1,921 and 2,414 sentences, respectively. For training, however, we use all of the remaining data (sections 1-22 and 24), in contrast to the usual training set for CCGbank (sections 2-21). This is simply to make full use of all available data and yields 44,833 sentences for training.
For our inside-outside algorithm implementation, we adopt Rush's (2020) overall method for adapting it to GPU matrix operations. We implement the intensive parts of the algorithm as custom CUDA kernels that operate on packed Booleans. We use the fastmurty library (Motro and Ghosh, 2019) for the -best matchings algorithm.
Training examples are sorted by output sequence length to yield efficient batches; the ordering of the batches is shuffled every epoch. We clip gradients, scaling accordingly, if the sum of gradient norms exceeds 1. We train our models with the AdamW optimizer (Loshchilov and Hutter, 2019;Kingma and Ba, 2014) for 40 epochs, halving the learning rate when performance reaches a plateau with patience of three epochs. We keep the model weights from the epoch with the best development set performance. We report results averaged over three training runs with different random seeds.

Experimental conditions and evaluation
We evaluate four conditions: (1) the baseline model (Section 3.1) trained only with NLL ; (2) the improved model (Section 3.2) trained only with NLL ; (3) the improved model trained with both NLL and L* ; and (4) the improved model trained only with L* . The latter condition is trained without ground truth while the others are trained with it. Since comparing the two cases would be unfair (especially on a measure such as sentence accuracy), we evaluate them separately. To evaluate the effect of allowing -best linkages (Section 3.3), we evaluate all conditions with both = 1 and = 512. Note that with = 1, our baseline model is similar in design to that of Kogkalidis et al. (2020), with minor differences such as model sizes and vector embedding details; this represents the closest point of comparison while controlling for our other model aspects as well as our grammar and corpus.
We evaluate our parser using four measures: (1) link accuracy, the percentage of positive vertices that were assigned their correct negative vertex; (2) sentence accuracy, the percentage of sentences with 100% link accuracy; (3) coverage, the percentage of sentences for which an L*-integral linkage was found; and (4) the average number of unique parses (i.e., L*-integral) found per sentence.
When used for computing NLL , we use Sinkhorn temperature = 0.01. We use a separate temperature parameter L* when computing L* . The intuition behind this is that because NLL  permits one specific parse while L* permits multiple parses, they may require different levels of uncertainty in their corresponding doubly-stochastic matrices. We treat L* as a hyperparameter with initial values sampled log-uniformly on [0.01, 10). For the condition that includes both NLL and L* , we linearly combine the two to obtain the final objective function = NLL +(1 − ) L* . We tune as a hyperparameter as well, with initial uniform sampling on [0.05, 0.95). Table 2 shows the performance of the systems trained against gold-standard linkages. Evaluating multiple linkages from the single score matrix is clearly beneficial on all accounts. In particular, doing so yields almost complete coverage for all cases, but especially for our two improved versions. The accuracies improve as well; since our sorting of multiple candidates linkages is stable, the improvements to sentence accuracy come from cases where the correct parse was scored higher than other valid parses, but lower than some invalid parses. Here, filtering the list using the term graph validity conditions is clearly useful.

Training with ground truth
Incorporating term graph structure in the model as described in Section 3.2 improves performance as well, though not by as much as evaluating multiple linkages. While we expected the number of parses per sentence found by the parser to increase due to the presence of grammatico-structural information, in fact it returned fewer parses. With NLL as the sole training objective, the model uses this extra information solely to increase its perfor-mance as measured by that objective. Interestingly, adding L* to the model improvements seems to decrease accuracy, nearly to the baseline's level for the = 512 case. Coverage remains high, however. In this case, we believe that the two training objectives are somewhat conflicting, with NLL pushing the model towards the correct linkage but L* equally preferring other valid linkages.

Training without ground truth
Training a model without ground-truth linkages impairs system performance substantially, as expected: the model has no signal guiding it to the correct linkage, nor differentiating the correct linkage from other valid ones. With = 1, the system achieves 91.2% coverage on the LCGbank test set.
With = 512, this increases substantially to 96.2%. Here the parser finds an average of 5.9 parses/sentence. Since it did not find a single valid parse for 3.8% of sentences, the number of parses found for covered sentences is 6.2. This is further in line with the idea that L* "pulls" the model away from the correct parse in the direction of other (valid) parses.
Since the loss function cannot distinguish correct linkages from other valid ones, this configuration cannot be expected to select the correct linkage. Nonetheless, the correct parse appears in the system's set of output parses for 79.0% of sentences, appearing at the top (i.e., the correct sentence is given the highest score) for 53.4% of sentences with = 1 and 54.9% of sentences with = 512.

Analysis
Finally, we conduct a post-hoc ablation study for the ground-truth-free condition. For each ablated as, we adjust the model or loss function accordingly, and then retrain the model from scratch using the same hyperparameters as the original model. Table 3 shows the results, comparing coverage of the ablated versions with that of the original.
We see that removing all planarity information (i.e., the link filtering, the planarity-aware attention, and the planarity loss term T1 ) is disastrous; this condition has by far the largest drop in coverage. This is especially notable as LCG proof nets must be half-planar due to the non-commutativity of L*; this useful constraint is not present in type-logical grammars that do not have this property, such as that employed by Kogkalidis et al. (2020).
Other decreases range from moderate to small:  Table 3: LCGbank test set coverage under various ground-truth-free training conditions. − removes loss term ; −RL removes regular and Lambek edges; −IW removes the filter on intra-word links; −NP removes the filter on non-planar links; −PA removes planarity-aware attention. In contrast to Table 2, here the ablated versions (all but the first line) are results from one single training run each.
• All three loss terms are important, with coverage decreasing notably upon ablation; the decrease is lowest for T1 , suggesting that its removal is partially ameliorated by the other sources of planarity information in the model.
• Removing the regular and Lambek edge information decreases coverage by a small amount.
• Filtering out intra-word links is surprisingly important; we had suspected that, since the model has information about which words are the same for given atomic category pairs, it would learn to avoid them. If the filter on nonplanar links is also removed, coverage drops further. Removing planarity-aware attention and the proof frame edge information (i.e., stripping down to the baseline system of Section 3.1, but here training with the structural loss only) strangely slightly restores coverage.

Conclusion and future work
We have presented an LCG parser with multiple novel techniques, including neural term graph structure and structural constraint encodings, novel loss functions derived from LCG term graph validity conditions, and a self-attention-based system for returning and efficiently evaluating -best matchings. Evaluating on a corpus of English LCG proof nets, we found our improvements to be effective, especially the -best matchings. Our loss functions, furthermore, enable training an LCG parsing model without ground-truth derivations or linkages. Analysis shows that planarity conditions are especially important, but that all of our alterations contribute to the parser's improved performance. As we saw in Table 2, combining NLL and L* seems to be detrimental to parser accuracy. The two loss terms have seemingly conflicting objectives, with the former concentrating probability mass around a single solution and the latter spreading probability mass over multiple solutions. We believe it would be worthwhile to explore combining these two in a more congruent manner.
Since our model allows differentiating through the structure of lexical categories, the obvious next step is to incorporate a supertagger and pass gradients down to it. As it stands, supertaggers have rudimentary knowledge of their context, with no notion of how the atomic subcategories of one category might combine with those of another. A tight coupling of the techniques we proposed here with an appropriately designed supertagger would yield a true end-to-end differentiable LCG parser.
Lastly, we believe further investigation of structural constraints and objectives to be promising. Although we still relied on supertags from the corpus, our results with the grammatico-structural loss functions demonstrate the training of a highcoverage parser with a decreased annotation burden. Techniques such as those presented here suggest a potential path to parsing with lower data requirements, or perhaps even to structured, formalismdriven unsupervised parsing.