Span-based Semantic Parsing for Compositional Generalization

Despite the success of sequence-to-sequence (seq2seq) models in semantic parsing, recent work has shown that they fail in compositional generalization, i.e., the ability to generalize to new structures built of components observed during training. In this work, we posit that a span-based parser should lead to better compositional generalization. we propose SpanBasedSP, a parser that predicts a span tree over an input utterance, explicitly encoding how partial programs compose over spans in the input. SpanBasedSP extends Pasupat et al. (2019) to be comparable to seq2seq models by (i) training from programs, without access to gold trees, treating trees as latent variables, (ii) parsing a class of non-projective trees through an extension to standard CKY. On GeoQuery, SCAN and CLOSURE datasets, SpanBasedSP performs similarly to strong seq2seq baselines on random splits, but dramatically improves performance compared to baselines on splits that require compositional generalization: from 61.0 → 88.9 average accuracy.


Introduction
The most dominant approach in recent years for semantic parsing, the task of mapping a natural language utterance to an executable program, has been based on sequence-to-sequence (seq2seq) models (Jia and Liang, 2016;Dong and Lapata, 2016;Wang et al., 2020, inter alia). In these models, the output program is decoded step-by-step (autoregressively), using an attention mechanism that softly ties output tokens to the utterance.
Despite the success of seq2seq models, recently, Finegan-Dollak et al. (2018) and Keysers et al. (2020) and Herzig and Berant (2019) demonstrated that such models fail at compositional generalization, that is, they do not generalize to program structures that were not seen at training time. For example, a model that observes at training time the questions "What states border China?" and "What is the largest state?" fails to generalize to questions such as "What states border the largest state?". This is manifested in large performance drops on data splits designed to measure compositional generalization (compositional splits), and is in contrast to the generalization abilities of humans (Fodor and Pylyshyn, 1988).
In this work, we posit that the poor generalization of seq2seq models is due to fact that the input utterance and output program are only tied softly through attention. We revisit a more traditional approach for semantic parsing (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Liang et al., 2011), where partial programs are predicted over short spans in the utterance, and are composed to build the program for the entire utterance. Such explicit inductive bias for compositionality should encourage compositional generalization.
Specifically, we propose to introduce such inductive bias via a span-based parser (Stern et al., 2017;Pasupat et al., 2019), equipped with the advantages of modern neural architectures. Our model, SPAN-BASEDSP, predicts for every span in the input a category, which is either a constant from the underlying knowledge-base, a composition category, or a null category. Given the category predictions for all spans, we can construct a tree over the input utterance and deterministically compute the output program. For example, in Figure 1, the category for the tree node covering the span "New York borders ?" is the composition category join, indicating the composition of the predicate next_to_1 with the entity stateid('new york').
Categories are predicted for each span independently, resulting in a very simple training procedure. CKY is used at inference time to find the best span tree, which is a tree with a category predicted at every node. The output program is computed from join: capital(loc_2(state(next_to_1(NY))) join: capital(loc_2(state(next_to_1(NY))) join: loc_2(state(next_to_1(NY))) join: state(next_to_1(NY)) join: next_to_1(NY) join: next_to_1  Figure 1: An example span tree. Nodes are annotated with categories (in bold). A node with a category join over the span (i, j), is annotated with its sub-program z i:j . We abbreviate stateid('new york') to NY. this tree in a bottom-up manner.
We enhance the applicability of span-based semantic parsers (Pasupat et al., 2019) in terms of both supervision and expressivity, by overcoming two technical challenges. First, we do not use gold trees as supervision, only programs with no explicit decomposition over the input utterance. To train with latent trees, we use a hard-EM approach, where we search for the best tree under the current model corresponding to the gold program, and update the model based on this tree. Second, some gold trees are non-projective, and cannot be parsed with a binary grammar. Thus, we extend the grammar of CKY to capture a class of non-projective structures that are common in semantic parsing. This leads to a model that is comparable and competitive with the prevailing seq2seq approach.
We evaluate our approach on three datasets, and find that SPANBASEDSP performs similarly to strong seq2seq baselines on standard i.i.d (random) splits, but dramatically improves performance on compositional splits, by 32.9, 34.6 and 13.5 absolute accuracy points on GEOQUERY (Zelle and Mooney, 1996), CLOSURE (Bahdanau et al., 2019), and SCAN (Lake and Baroni, 2018) respectively. Our code and data are available at https:// github.com/jonathanherzig/span-based-sp.

Problem Setup
We define span-based semantic parsing as follows. Given a training set , where x i is an utterance and z i is the corresponding program, our goal is to learn a model that maps a new utterance x to a span tree T (defined below), such that program(T )= z. The deterministic function program(·) maps span trees to programs.
Span trees A span tree T is a tree (see Figure 1) where, similar to constituency trees, each node covers a span (i, j) with tokens x i:j = (x i , x i+1 , . . . , x j ). A span tree can be viewed as a mapping from every span (i, j) to a single category c ∈ C, where categories describe how the meaning of a node is derived from the meaning of its children. A category c is one of the following: • Σ: a set of domain-specific categories representing domain constants, including entities and predicates. E.g., in Figure 1, capital, state, loc_2 and next_to_1 are binary predicates, and stateid('new york') is an entity. • join: a category for a node whose meaning is derived from the meaning of its two children. At most one of the children's categories can be the φ category. • φ: a category for (i) a node that does not affect the meaning of the utterance. For example, in Figure  1, the nodes that cover "What is the" and "?" are tagged by φ; (ii) spans that do not correspond to constituents (tree nodes). Overall, the category set is C = Σ ∪ {φ, join}. We also define the terminal nodes set Σ + = Σ ∪ {φ}, corresponding to categories that are directly over the utterance.
Computing programs for span trees Given a mapping from spans to categories specifying a span tree T , we use the function program(·) to find the program for T . Concretely, program(T ) iterates over the nodes in T bottom-up, and generates a program z i:j for each node covering the span (i, j).
The program z i:j is computed deterministically. For a node with category c ∈ Σ, z i:j = c. For a join node over the span (i, j), we determine z i:j by composing the programs of its children, z i:s and z s,j where s is the split point. As in Combinatory Categorical Grammar (Steedman, 2000), composition is simply function application, where a domain-specific type system is used to determine which child is the function and which is the argument (along with the exact argument position for predicates with multiple arguments). If the category of one of the children is φ, the program for z i:j is copied from the other child. E.g., in Figure 1, the span (8, 9), where z 8:9 = stateid('new york') combines with the span (10, 11), where z 10:11 = next_to_1. As z 10:11 is a binary predicate that takes an argument of type state, and z 8:9 is an entity of type state, the output program is z 8:11 = next_to_1(stateid('new york')). If no combination is possible according to the type system, the execution of program(T ) fails ( §3.2).
Unlike seq2seq models, computing programs with span trees is explicitly compositional. Our main hypothesis is that this strong inductive bias should improve compositional generalization.

A Span-based Semantic Parser
Span-based parsing had success in both syntactic (Stern et al., 2017;Kitaev and Klein, 2018) and semantic parsing (Pasupat et al., 2019). The intuition is that modern sequence encoders are powerful, and thus we can predict a category for every span independently, reducing the role of global structure. This leads to simple and fast training.
Specifically, our parser is based on a model p θ (T [i, j] = c), parameterized by θ, that provides for every span (i, j) a distribution over categories c ∈ C. Due to the above independence assumption, the log-likelihood of a tree T is defined as: where, similar to Pasupat et al. (2019), the sum is over all spans i < j and not only over constituents. We next describe the model p θ (T [i, j]) and its training, assuming we have access to gold span trees at training time ( §3.1). We will later ( §3.3) remove this assumption, and describe a CKY-based inference procedure ( §3.2) that finds for every training example (x, z) the (approximately) most probable span tree T * train , such that program(T * train ) = z. We use T * train as a replacement for the gold tree. Last, we present an extension of our model that covers a class of span trees that are non-projective ( §3.4).

Model
We describe the architecture and training procedure of our model (SPANBASEDSP), assuming we are given for every utterance x a gold tree T , for which program(T) = z.
Similar to Pasupat et al. (2019), we minimize the negative log-likelihood − log p(T ) (Eq. 1) for the gold tree T . The loss decomposes over spans into cross-entropy terms for every span (i, j). This effectively results in multi-class problem, where for every span x i:j we predict a category c ∈ C. Training in this setup is trivial and does not require any structured inference.
Concretely, the architecture of SPANBASEDSP is based on a BERT-base encoder (Devlin et al., 2019) that yields a contextual representation h i ∈ R h dim for each token x i in the input utterance. We represent each span (i, j) by concatenating its start and end representations [h i ; h j ], and apply a 1hidden layer network to produce a real-valued score s(x i:j , c) for a span (i, j) and category c: , and ind(c) is the index of the category c. We take a softmax to produce the probabilities: and train the model with a cross-entropy loss averaged over all spans, as mentioned above.

CKY-based Inference
While we assume span-independence at training time, at test time we must output a valid span tree. We now describe an approximate K-best CKY algorithm that searches for the K most probable trees under p(T ), and returns the highest-scoring one that is semantically valid, i.e., that can be mapped to a program. 1 As we elaborate below, some trees cannot be mapped to a program, due to violations of the type system. We start by re-writing our objective function, as proposed in Pasupat et al. (2019). Given our S := join join | φ join join := join join | join φ definition for p θ (T [i, j] = c), the log-likelihood is: We shift the scoring function s(·) for each span, such that the score for the φ category is zero: Because softmax is shift-invariant, we can replace s(·) for s (·) and preserve correctness. This is motivated by the fact that φ nodes, such as the one covering "What is the" in Figure 1, do not affect the semantics of utterance. By shifting scores such that for all spans s'(x i:j , φ) = 0, their score does not affect the overall tree score. Spans that do not correspond to tree nodes are labeled by φ and also do not affect the tree score.
Furthermore, as i<j log c exp[s'(x i:j , c )] does not depend on T at all, maximizing log p(T ) is equivalent to maximizing the tree score: This scoring function can be maximized using CKY (Cocke, 1969;Kasami, 1965;Younger, 1967). We now propose a grammar, which imposes further restrictions on the space of possible output trees at inference time.
We use a small grammar G = (N, Σ + , R, S), where N = {S, join} is the set of non-terminals, Σ + is the set of terminals (defined in §2), R is a set of four rules detailed in Figure 2, and S is a special start symbol. The four grammar rules impose the following constraints on the set of possible output trees: (a) a join or S node can have at most one φ child, as explained in §2; (b) nodes with no semantics combine with semantic elements on their left; (c) except at the root where they combine with elements on their right. Imposing such consistent tree structure is useful for training SPANBASEDSP when predicted trees are used for training ( §3.3).
Algorithm 1: CKY inference algorithm Input: ∀i, j, c : s(xi:j, c), G = (N, Σ + , R, S), x Output: π -scores for each span and non-terminal The grammar G can generate trees that are not semantically valid. For example, we could generate the program capital(placeid('mount mckinley')), which is semantically vacuous. We use a domain-specific type system and assign the score S(T ) = −∞ to every tree that yields a semantically invalid program. This global factor prevents exact inference, and thus we perform Kbest parsing, keeping the top-K (K = 5) best trees for every span (i, j) and non-terminal.
Alg. 1 summarizes CKY inference, that outputs π(i, j, X), the maximal score for a tree with nonterminal root X over the span (i, j). In Lines 1-3 we initialize the parse chart, by going over all spans and setting π(i, j, join) to the top-K highest scoring domain constants (Σ), and fixing the score for φ to be zero. We then perform the typical CKY recursion to find the top-K trees that can be constructed through composition (Line 6), merge them with the domain constants found during initialization (Line 7), and keep the overall top-K trees.
Once inference is done, we retrieve the top-K trees from π(1, |x|, S), iterate over them in descending score order, and return the first tree T * that is semantically valid.

Training without Gold Trees
We now remove the assumption of access to gold trees at training time, in line with standard supervised semantic parsing, where only the gold program z is given, without its decomposition over x. This can be viewed as a weakly-supervised setting, where the correct span tree is a discrete latent variable. In this setup, our goal is to maximize Because marginalizing over trees is intractable, we take a hard-EM approach Min et al., 2019), and replace the sum over trees with an argmax. More concretely, to approximately solve the argmax and find the highest scoring tree, T * train , we employ a constrained version of Alg. 1, that prunes out trees that cannot generate z.
We first remove all predictions of constants that do not appear in z by setting their score to −∞: where const(z) is the set of domain constants appearing in z. Second, we allow a composition of two nodes covering spans (i, s) and (s, j) only if their sub-programs z i:s and z s:j can compose according to z. For instance, in Figure 1, a span with the sub-program capital can only compose with a span with the sub-program loc_2(·). After running this constrained CKY procedure we return the highest scoring tree that yields the correct program, T * train , if one is found. We then treat the span structure of T * train as labels for training the parameters of SPANBASEDSP.
Past work on weakly-supervised semantic parsing often used maximum marginal likelihood, especially when training from denotations only (Guu et al., 2017). In this work, we found hard-EM to be simple and sufficient, since we are given the program z that provides a rich signal for guiding search in the space of latent trees.

Exact match features
The challenge of weaklysupervised parsing is that SPANBASEDSP must learn to map language phrases to constants, and how the span tree is structured. To alleviate the language-to-constant problem we add an exact match feature, based on a small lexicon, indicating whether a phrase in x matches the language description of a category c ∈ Σ. These features are considered in SPANBASEDSP when some phrase matches a category from Σ, updating the score s(x i:j , c) to be: where δ(x i:j , c) is an indicator that returns 1 if c ∈ lexicon[x i:j ], and 0 otherwise, and λ is a hyper-parameter that sets the feature's importance.
We use two types of lexicon[·] functions. In the first, the lexicon is created automatically to map the names of entities (not predicates), as they appear in Σ, to their corresponding constant (e.g., lexicon["new york"] = stateid('new york')). This endows SPAN-BASEDSP with a copying mechanism, similar to join: largest_one(pop_1(state(all))) seq2seq models, for predicting entities unseen during training. In the second lexicon we manually add no more than two examples of language phrases for each constant in Σ. E.g., for the predicate next_to_1, we update the lexicon to include lexicon["border"] = lexicon["borders"] = next_to_1. This requires minimal manual work (if no language phrases are available), but is done only once, and is common in semantic parsing (Zettlemoyer and Collins, 2005;Wang et al., 2015;.

Non-Projective Trees
Our span-based parser assumes composition can only be done for adjacent spans that form together a contiguous span. However, this assumption does not always hold (Liang et al., 2011). For example, in Figure 3, while the predicate pop_1 should combine with the predicate state, the spans they align to ("people" and "state" respectively) are not contiguous, as they are separated by "most", which contributes the semantics of a superlative.
In constituency parsing, such non-projective structures are treated by adding rules to the grammar G (Maier et al., 2012;Corro, 2020;Stanojević and Steedman, 2020). We identify one specific class of non-projective structures that is frequent in semantic parsing (Figure 3), and expand the grammar G and the CKY Algorithm to support this structure. Specifically, we add the ternary grammar rule join := join join join. During CKY, when calculating the top-K trees for spans (i, j) (line 6 in Alg. 1), we also consider the following top-K scores for the non-terminal join: [s'(x ij , join) + π(i, s 1 , join) + π(s 1 + 1, s 2 , join) + π(s 2 + 1, j, join)].
These additional trees are created by going over all possible ways of dividing a span (i, j) into three parts. The score of the sub-tree is then the sum of the score of the root added to the scores of the three children. To compute the program for such ternary nodes, we again use our type system, where we first compose the programs of the two outer spans (i, s 1 ) and (s 2 + 1, j) and then compose the resulting program with the program corresponding to the span (s 1 + 1, s 2 ). Supporting ternary nodes in the tree increases the time complexity of CKY from O(n 3 ) to O(n 4 ) for our implementation. 2

Experiments and Results
We now present our experimental evaluation, which demonstrates the advantage of span-based parsing for compositional generalization. We compare to baseline models over two types of data splits: (a) IID split, where the training and test sets are sampled from the same distribution, and (b) compositional split, where the test set includes structures that are unseen at training time. Details on the experimental setup are given in Appendix A.

Datasets
We evaluate on the following datasets (Table 1).  Corro (2020) show an O(n 3 ) algorithm for this type of non-projective structure.

york')
and stateid('utah') are anonymized to STATE). We then split to train/development/test sets, such that all examples that share a template are assigned to the same set. We also verify that the sizes of theses sets are as close as possible to the IID split.
For the compositional split, LENGTH, we sort the dataset by program token length and take the longest 280 examples to be the test set. We then randomly split the shortest 600 examples between the train and development set, where we take 10% of the 600 examples for the latter.
CLEVR and CLOSURE CLEVR (Johnson et al., 2017) contains synthetic questions, created using 80 templates, over synthetic images with multiple objects of different shapes, colors, materials and sizes (example in Fig. 4 in the Appendix). The recent CLOSURE dataset (Bahdanau et al., 2019), includes seven new question templates that are created by combining referring expressions of various types from CLEVR in new ways.
We use the semantic parsing version of these datasets, where each image is described by a scene (knowledge-base) that holds the attributes and positional relations of all objects. We use programs in the DSL version from Mao et al. (2019).
For our experiments, we take 5K examples from the original CLEVR training set and treat them as our development set. We use the other 695K examples as training data for our baselines. Importantly, we only use 10K training examples for SPANBASEDSP to reduce training time. We then create an IID split where we test on the CLEVR original development set (test scenes are not publicly available). We additionally define the CLO-SURE split, that tests compositional generalization, where we test on CLOSURE.
SCAN-SP SCAN (Lake and Baroni, 2018) contains natural language navigation commands that are mapped to action sequences (x and y in Fig.  5 in the Appendix). As SCAN lacks programs, we automatically translate the input to programs (z in Fig. 5) to crate the semantic parsing version of SCAN, denoted SCAN-SP (more details are given in Appendix B). We experiment with the random SIMPLE split from Lake and Baroni (2018) as our IID split. we further use the primitive right (RIGHT) and primitive around right (AROUNDRIGHT) compositional splits from Loula et al. (2018). For each split we randomly assign 20% of the training set   for development.

Baselines
SEQ2SEQ Similar to Finegan-Dollak et al. (2018), our baseline parser is a standard seq2seq model (Jia and Liang, 2016) that encodes the utterance x with a BiLSTM encoder over pre-trained GloVe (Pennington et al., 2014) or ELMO  embeddings, and decodes the program with an attention-based LSTM decoder (Bahdanau et al., 2015) assisted by a copying mechanism for handling entities unseen during training time (Gu et al., 2016).
BERT2SEQ Same as SEQ2SEQ, but we replace the BiLSTM encoder with BERT-base, which is identical to the encoder of SPANBASEDSP.
GRAMMAR Grammar-based decoding has been shown to improve performance on IID splits (Krishnamurthy et al., 2017;Yin and Neubig, 2017). Because decoding is constrained by the grammar, the model outputs only valid programs, which could potentially improve performance on compositional splits. We use the grammar from (Wong and Mooney, 2007) for GEOQUERY, and write grammars for SCAN-SP and CLEVR + CLOSURE. The model architecture is identical to SEQ2SEQ.
BART We additionally experiment with BART-base (Lewis et al., 2020), a seq2seq model pre-trained as a denoising autoencoder.
END2END Semantic parsers generate a program that is executed to retrieve an answer. However, other end-to-end models directly predict the answer from the context without an executor, where the context can be an image (Hudson and Manning, 2018;Perez et al., 2018), a table (Herzig et al., 2020), etc. Because CLEVR and CLOSURE have a closed set of 28 possible answers and a short context (the scene), they are a good fit for end-to-end approaches. To check whether end-to-end models generalize compositionally, we implement the following model. We use BERT-base to encode the concatenation of the input x to a representation of all objects in the scene. Each scene object is represented by adding learned embeddings of all of its attributes: shape, material, size, color, and relative positional rank (from left to right, and from front to back). We fine-tune the model on the training set using cross-entropy loss, where the [CLS] token is used to predict the answer.  (Dong and Lapata, 2016), and loses just 4 points on the compositional TEMPLATE split. On the LENGTH split, SPANBASEDSP yields an accuracy of 63.6, substantially outperforming all baselines by more than 37 accuracy points.

Main Results
Our ablations show that the lexicon is crucial for GEOQUERY, which has a small training set. In this setting, learning the mapping from language phrases to predicates is challenging. Ablating nonprojective parsing also hurts performance for GEO-QUERY, and leads to a reduction of 2-6 points for all of the splits.

Decomposition Analysis
We now analyze whether trees learned by SPAN-BASEDSP are similar to gold trees. For this analysis we semi-automatically annotate our datasets with gold trees. We do this by manually creating a domain-specific lexicon for each dataset (extending the lexicon from §3.3), mapping domain constants to possible phrases in the input utterances. We then, for each example, traverse the program tree (rather than the span tree) bottom-up and annotate join and φ categories for spans in the utterance, aided by manually-written domain-specific rules. In cases where the annotation is ambiguous, e.g., examples with more than two instances of a specific domain constant, we do not produce a gold tree.
We manage to annotate 100%/94.9%/95.9% of the examples in SCAN-SP/ GEOQUERY/ CLEVR + CLOSURE respectively in this manner. We verify the correctness of our annotation by training SPANBASEDSP from our annotated gold trees (bottom part of Table 2). The results shows that training from these "gold" trees leads to similar performance as training only from programs.
We then train SPANBASEDSP from gold programs, as explained in §3.3, and calculate F 1 test scores, comparing the predicted span trees to the gold ones. F 1 is computed between the two sets of labeled spans, taking into account both the spans and their categories, but excluding spans with the φ category that do not contribute to the semantics. Table 3 shows that for GEOQUERY the trees SPANBASEDSP predicts are similar to the gold

Limitations
Our approach assumes a one-to-one mapping between domain constants and their manifestation as phrases in language. This leads to strong results on compositional generalization, but hurts the flexibility that is sometimes necessary in semantic parsing. For example, in some cases predicates do not align explicitly to a phrase in the utterance or appear several times in the program but only once in the utterance (Berant et al., 2013;Pasupat and Liang, 2015). This is evident in text-to-SQL parsing, where an utterance such as "What is the minimum, and maximum age of all singers from France?" is mapped to SELECT min(age) , max(age) FROM singer WHERE country='France'. Here, the constant age is mentioned only once in language (but twice in the program), and country is not mentioned at all. Thus, our approach is more suitable for formalisms where there is tighter alignment between the natural and formal language.
In addition, while we handle a class of nonprojective trees ( §3.4), there are other nonprojective structures that SPANBASEDSP can not parse. Extending CKY to support all structures from Corro (2020) leads to a time complexity of O(n 6 ), which might be impractical.

Related Work
Until the neural era, semantic parsers used a lexicon and composition rules to predict partial programs for spans and compose them until a full program is predicted, and typically scored with a log-linear model given features over the utterance and the program (Zettlemoyer and Collins, 2005;Liang et al., 2011). In this work, we use a similar compositional approach, but take advantage of powerful span representations based on modern neural architectures.
The most similar work to ours is by Pasupat et al. (2019), who presented a neural span-based semantic parser. While they focused on training using projective gold trees (having more supervision and less expressivity than seq2seq models) and testing on i.i.d examples, we handle non-projective trees, given only program supervision, rather than trees. More importantly, we show that this approach leads to dramatic gains in compositional generalization compared to autoregressive parsers.
In recent years, work on compositional generalization in semantic parsing mainly focused on the poor performance of parsers in compositional splits (Finegan-Dollak et al., 2018), creating new datasets that require compositional generalization (Keysers et al., 2020;Bahdanau et al., 2019), and proposing specialized architectures mainly for the SCAN task (Lake, 2019;Nye et al., 2020;Gordon et al., 2020;Gupta and Lewis, 2018). In this work we present a general-purpose architecture for semantic parsing that incorporates an inductive bias towards compositional generalization. Finally, concurrently to us, Shaw et al. (2020) induced a synchronous grammar over program and utterance pairs and used it to introduce a compositional bias, showing certain improvements over compositional splits.

Conclusion
Seq2seq models have become unprecedentedly popular in semantic parsing but struggle to generalize to unobserved structures. In this work, we show that our span-based parser, SPANBASEDSP, that precisely describes how meaning is composed over the input utterance leads to dramatic improvements in compositional generalization. In future work, we plan to investigate ways to introduce the explicit compositional bias, inherent to SPANBASEDSP, directly into seq2seq models.
x: Are there any shiny objects that have the same color as the matte block? z: exist(filter(metal,relate_att_eq(color,filter(rubber,cube,scene()))))  Figure 4: An example span tree from CLEVR, along with its utterance x and program z. Here, the type system is used in join nodes to deterministically invoke the predicates filter and scene where needed. Sub-programs are omitted due to space reasons.  Figure 5: An example span tree from SCAN-SP, along with its utterance x, program z and action sequence y. The category join is abbreviated to J.