A Corpus-based Syntactic Analysis of Two-termed Unlike Coordination

Coordination is a phenomenon of language that conjoins two or more terms or phrases using a coordinating conjunction. Although coordination has been explored extensively in the linguistics literature, the rules and constraints that govern its structure are still largely elusive and widely debated amongst linguists. This paper presents a study of two-termed unlike coordinations in particular, where the two conjuncts of the coordination phrase form valid constituents but have distinct categories. We conducted a syntactic analysis of the phrasal categories that can be conjoined in such unlike coordinations through a computational corpus-based approach, utilizing the Corpus of Contemporary American English (COCA) as the main data source, as well as the Penn Treebank (PTB). The results show that the two conjuncts within unlike coordinations display different properties based on their position, supporting an antisymmetric view of the structure of coordination. This research provides new data and perspectives through the use of statistical tech-niques that can help shape future theories and models of coordination.


Motivation
Coordination is a phenomenon of language that conjoins two or more terms or phrases. The terms or phrases that are grouped in coordination phrases are normally called conjuncts, and they are often conjoined by a coordinating conjunction, such as and, or, but, or nor. A common assumption in the linguistics literature is that two elements may only be coordinated if they share the same syntactic category, as in (1).
(1) a. [ NP The chicken] and [ NP the rice] go well together. b. The president will [ VP understand the criticism] and [ VP take action].
For example, in (1a), the two conjuncts being coordinated are "the chicken" and "the rice," which share the same syntactic category of noun phrase (NP). The assumption that the conjuncts of a coordination phrase will always have the same category is known as the Law of the Coordination of Likes (LCL) (Williams, 1981). The LCL explains why many instances of coordination are ungrammatical, such as the coordination of a prepositional phrase (PP) and a clause (CP) shown in (2) (Prażmowska, 2015).
(2) a. The scene of the movie was in Chicago. b.
The scene that I wrote was in Chicago. c. * The scene [ PP of the movie] and [ CP that I wrote] was in Chicago.
Even though the prepositional phrase and the clause are both grammatical when standing alone within the context sentence, as in (2a) and (2b), their coordination in (2c) is ungrammatical, supposedly because of the LCL. However, several examples of syntactically unlike coordination can be found in English, such as the examples in (3) (Sag et al., 1985).
( 3)  In the above examples, the two conjuncts within each coordination phrase do not share the same syntactic category. In these cases, the LCL seems to be too restrictive. Yet, there are also cases in which the LCL is not restrictive enough-a coordination phrase can still be ungrammatical even if its conjuncts have the same syntactic category (Prażmowska, 2015).
Example (4a) contains the coordination of two prepositional phrases, and (4b) contains the coordination of two adverbs. Despite the two conjuncts having like categories, these examples result in ungrammatical sentences. Semantics seems to play a role in the acceptability of coordinations as well; a stronger version of the LCL requires that conjuncts must also be alike in their semantic function. For example, in (4a), the first prepositional phrase "with his mother" expresses accompaniment, whereas the second "with good appetite" expresses manner (Prażmowska, 2015). However, identifying and articulating rigorous rules that predict all grammatical possibilities of coordination has been a difficult task for linguists, and as a result, the underlying syntactic structure of coordination phrases has been elusive.

Goal
The goal of this project is to explore and answer questions about the syntax of coordination phrases through a quantitative corpus analysis. By analyzing a large corpus of naturally-occurring spoken and written language using natural language processing and statistical techniques, we will investigate the patterns of syntactic categories found in unlike coordinations. An overarching goal for this project is to share data that may inform linguistic hypotheses about the underlying structure of coordination. By taking a computational approach, we can explore a larger and deeper set of questions regarding coordination, such as: • What combinations of syntactic categories are attested in English data, and which appear most frequently?
• Does this depend on the genre of the text or the type of conjunction (and, or, but, nor)?
This paper begins by introducing the relevant problem background and related work. We then detail our corpus-based approach and implementation, which utilizes the Corpus of Contemporary American English (COCA), the Penn Treebank (PTB), and the Berkeley Neural Parser. We then follow with a presentation of the results and provide an in-depth discussion of the significant findings.

Background and Related Work
Capturing the structure of coordination has been a difficult problem in many theories of syntax. A flat, multi-headed structure was proposed in earlier theories, in which two or more lexical heads share the same phrase-level projection, as in the templates shown in Figure 1 (Progovac, 1998a;Chomsky, 1981). This theory captures the intuitive idea that the coordination of two NPs is an NP, that the coordination of two VPs is a VP, etc. An example of a two-termed coordination of NPs is provided in (5). We use CC as the name for the functional category of coordinating conjunctions, which is also the label used in the PTB.  In fact, the LCL was formulated due to this proposal for the syntax of coordination. Coordination was said to denote a relation between two (or more) elements that are "hierarchically equal" in that neither of the elements is more prominent than the other, leading to a symmetrical and flat vision of coordination structures (Prażmowska, 2015). Since conjuncts were assumed to be symmetrical and equal in status, it followed that they must share the same syntactic category to be grammatically coordinated.
One proposal that seems to address the existence of the unlike category coordinations seen in (6) is Bowers's Pred (predicate) functional category (Bowers, 1993). On top of the NPs, APs, and PPs being coordinated in these sentences, there is another level of structure. Bowers suggests that a null Pred head selects an NP, AP, or PP as its complement, forming a predicate phrase (PredP). Thus, unlike coordinations are actually like coordinations in disguise-all conjuncts have the category of PredP. PredPs are complements of the copula be in these sentences, as made apparent in (7) However, Bowers's proposal does not account for cases where the coordinated strings are not predicates, such as in (8). In each of these examples, the coordination phrase is an adjunct of VP rather than a predicate complement of VP, and the conjuncts semantically serve the purpose of adverbial modification. Other proposals dodge the problem of unlike coordination entirely by making the coordinating conjunction the head of its own coordination phrase (CCP). One example of such a theory is shown in (9). Here, conjuncts are specifiers and complements of the head conjunction (Johannessen, 1998;Zoerner, 1995). With such a construction, the categories of the conjuncts by themselves do not pose a restriction on the possibility of coordination. Thus, such theories do not have anything to say about the LCL, but they are still problematic in that they over-generate; no combinations of categories are prohibited.

Approach
We approached the task of capturing the structure of two-termed coordination by conducting a computational syntactic analysis on a large quantity of corpus data. Our primary data source is the Corpus of Contemporary American English (COCA) (Davies, 2015), and our additional data source is the Penn Treebank (PTB) augmented with Ficler and Goldberg's PTB coordination annotation extension (Ficler and Goldberg, 2016). We extracted coordination phrases from both of these datasets and performed a quantitative syntactic analysis using the constituency parses of the sentences within both texts. This approach has a few advantages over previous work. Much of the research that has shaped current theories of coordination have relied on the acceptability judgments of a few individuals, usually the author(s). By using corpus data, we gain an understanding of coordination on a much larger scale and emphasize empirical rather than intuitive judgments. We can also investigate differences in the patterns we identify based on the genre from which a coordination was found or the conjunction it contains.

Corpus Data
The Corpus of Contemporary American English (COCA) is a large, genre-balanced corpus of American English containing more than 450 million words of text (Davies, 2015). The COCA contains text from five genres: academic, fiction, magazine, newspaper, and spoken texts. Each genre includes 20 million words each year from 1990-2012. A balanced corpus, especially one that includes spoken data, was important for this project, as there may be variations in the coordinations found across different genres.
In addition to COCA data, we use the Penn Treebank (PTB), a collection of 2,499 stories from the Wall Street Journal gathered over a three-year pe-riod (Marcus et al., 1993). Sentences from the PTB are already tokenized and annotated with phrase structure, unlike the COCA. However, coordination annotations in the PTB are often inconsistent, include errors, and lack internal structure in many cases. For this reason, we make use of Ficler and Goldberg's PTB coordination annotation extension, which improves the coordination annotation in the PTB (Ficler and Goldberg, 2016). This extension provides an annotation that explicitly marks coordination phrases and the role of each element in coordination structures (i.e., conjuncts, markers, connectives, and shared elements are all identified and marked).

Syntactic Analysis
The main task of our syntactic analysis involves the detection and extraction of coordination phrases from our corpus data. Since the COCA is provided in a raw text format, we use the Berkeley Neural Parser to produce syntax trees of sentences in the COCA. This is a state-of-the-art constituency parser that generates syntax trees in the style of the Penn Treebank (Kitaev and Klein, 2018). To implement a good search algorithm for coordinations within parsed COCA data, we studied several sentence parse trees containing coordinations and identified three patterns in the way that the Berkeley Neural Parser most often represents the structure of coordination phrases, as shown in Figure 2.
Since the PTB is already annotated as phrase structure trees, the possible problems of using a constituency parser on novel text are eliminated. The identification of coordination phrases is made much simpler here with the help of the coordination annotation extension. The explicit function markers allow for the straightforward detection and isolation of conjuncts and conjunctions from other tangential elements that may be contained within a coordination phrase, such as modifiers and connectives. Figure 3 shows an example of a PTB phrase structure tree with the extension's additional function marking.
For our syntactic analysis, we include coordinations of six types of PTB phrasal category labels: noun phrases (NP), verb phrases (VP), prepositional phrases (PP), adjective phrases (ADJP), adverb phrases (ADVP), and subordinate clauses (SBAR, often called complementizer phrases (CP) in more recent syntax literature). We have chosen this set of labels because they correspond to the Figure 2: Three patterns used to detect two-termed coordination phrases in parsed COCA data. X, Y, and Z may be any PTB constituent tags. most frequent phrasal categories in the data. Once coordination phrases have been identified, we run statistical tests on the frequencies of their different attributes, such as the categories of the conjuncts, the type of conjunction used, and the genre from which the coordination was found.

Results
In our analyses, we employ the chi-square (χ 2 ) tests, which determine whether a set of observed frequencies deviate significantly from a set of expected frequencies. We consider p-values less than 0.05 to be statistically significant. Since our sam-   ple sizes are very large, we conduct additional posttests to accompany any statistically significant results. We use Cramer's V to measure strength of association (Table 1) (Akoglu, 2018).

Most Frequent Unlike Coordinations
We performed an analysis of the most frequent unlike coordinations in the COCA data. Figure 4 displays the top ten most common unlike coordinations found in all of the COCA data we parsed along with their relative frequencies, and Table 2 contains examples. We found a significant difference in the distribution of unlike category coordinations, with a moderate tendency toward the most common coordination combination, NP+SBAR, χ 2 (9, N = 24456) = 3142.0, p < .001, V = .119.

By COCA Genre
We also performed an analysis of the most frequent unlike coordinations in each of the five COCA genres. In each genre, a significant difference was found in the distribution of unlike category coordinations.   unlike coordinations in each genre. In the academic genre, there was a weak tendency toward the most common combination, NP+SBAR (Figure 7). In the fiction genre, a moderate tendency was found toward the most common combination, ADJP+VP (Figure 8). In the magazine genre, we also found a moderate tendency toward the most common combination, which was again NP+SBAR, as in the academic genre (Figure 9). In the newspaper genre, an indication of a moderate tendency toward the most common combination was found once again, with NP+VP being the most common combination ( Figure 10). In the spoken genre, there is a notable indication of a strong tendency toward the most common combination, which was NP+SBAR, as in the academic and magazine genres ( Figure 11).  Table 4: Summary of chi-square test and Cramer's V results for the frequency difference among the most common unlike coordinations based on the coordinating conjunction used to conjoin them (from COCA data).

By Conjunction
We also performed an analysis of the most frequent unlike category combinations based on the type of coordinating conjunction used to conjoin them. Table 4 summarizes the results of the chi-square tests and Cramer's V for each type of conjunction, and Appendix B again contains figures displaying the top unlike coordinations for each conjunction.
For the conjunctions and, or, and but, a significant difference was found in the distribution of unlike category coordinations. For unlike coordinations containing and, there was a moderate tendency toward the most common combination, which was NP+SBAR (Figure 12). For unlike coordinations containing or, we also found a moderate tendency toward the most common combination, which was again NP+SBAR (Figure 13). For unlike coordinations containing but, there was a weak tendency toward the most common combination, ADJP+VP ( Figure 14). For unlike coordinations containing nor, no significant difference was found in the distribution of unlike category coordinations ( Figure 15).

In the PTB
We performed an analysis of the most frequent unlike coordinations in the PTB as well. Figure 5 displays the top ten most common unlike coordinations in the PTB data, along with their relative frequencies. We found a significant difference in the distribution of unlike category coordinations with a moderate tendency toward the most common combinations, χ 2 (9, N = 216) = 22.981, p = .006, V = .109. The most common unlike coordination in the PTB was ADVP+PP.

Differences Between Conjunct Positions
In addition to the most frequent combinations of categories, we conducted an analysis of the categories for each conjunct independently. We first   report the results based on frequencies from the COCA. Table 5 summarizes the results of the chisquare tests and Cramer's V for each of the six phrasal categories. For NPs, a very strong tendency was found toward the first conjunct position; for VPs, a very strong tendency was found toward the second conjunct position; for PPs, only a weak tendency was found toward the first conjunct position; for ADJPs, a moderate tendency was found toward the first conjunct position; for ADVPs, only a negligible tendency was found toward the second conjunct position; and for SBARs, a very strong tendency was found toward the second conjunct position.
Next, we report the results based on frequencies from the PTB. Table 6 summarizes the results of the chi-square tests and Cramer's V for each of the six phrasal categories. For NPs, a very strong tendency was found toward the first conjunct position; for VPs, PPs, ADJPs, and ADVPs, no significant difference was found in the distribution of conjunct positions; and for SBARs, a very strong tendency was found toward the second conjunct position.  Table 6: Summary of chi-square test and Cramer's V results for the frequency difference between the two conjunct positions for each type of phrasal category from the PTB data.

Evaluation
A portion of the data we have presented in the previous section was gathered through the use of a constituency parser to identify coordination phrases. While the Berkeley Neural Parser is state-of-the-art, no parser is perfect, especially concerning coordination disambiguation. Furthermore, there are additional types of coordination structures that we do not consider, including non-constituent coordination and gapping. In non-constituent coordination, each conjunct in a coordination phrase does not form its own constituent under traditional theories of clause structure, as shown in example (10).
(10) The girl from California walked [into the room at 9 PM] and [out of the room at 10 PM].
Gapping is the phenomenon in which a phrase is coordinated with another phrase that seems to be missing some material, as shown in (11).

(11) [Mary ate beans] and [John potatoes].
While this paper only seeks to analyze the coordination of constituents and does not consider these additional types of coordination, they still pose challenges in the identification and labeling of coordination phrases by parsers. We have conducted an evaluation plan in which human raters manually assessed a random sample of unlike coordinations to estimate an error rate for each type of category combination. Each type of unlike coordination was assigned a score based on the judgments of three independent raters. A single rater contributes to the score by providing the percentage of samples in which they agreed with the parser's labels. The overall score for that type of coordination is then assigned by taking the mean of the three raters' scores. The scores for each type of unlike coordination are enumerated in Table 7, along with the sample size, confidence level, and margin of error used for sampling.

Most Frequent Unlike Coordinations
The results of the analysis of the most frequent unlike coordinations in the COCA data indicate that NP+SBAR is the most common unlike coordination. It was also the most frequent unlike coordination in three of the five genres (academic, magazine, and spoken). Some examples from the COCA are shown in (12)  One possible explanation for the high frequency of NP+SBAR coordinations is that subordinate clauses have very similar syntactic distributions to noun phrases in other contexts as well. In particular, subordinate clauses, which are called complementizer phrases (CP) in the syntax literature, can be the subjects of sentences. When a CP occupies the subject position of a sentence, it is called a sentential subject (Lohndal, 2014 Some linguists have theorized that sentential subjects and more typical nominal subjects have the same syntactic category. Much like Bowers's predicate phrase analysis discussed in Section 2, sentential subjects may be analyzed as having a null determiner head that forms a determiner phrase (DP) from a CP (Lohndal, 2014 The argument for clauses as DPs would posit that the same null determiner head that plays a role in the formation of plural DPs could also play a role in the formation of DPs from subordinate clauses. The data collected in this project provide more evidence through coordination that DPs and CPs have very similar syntactic distributions. While NP+SBAR was also within the top ten unlike coordinations in the PTB, ADVP+PP and PP+ADVP were the most common in the PTB. Examples from the PTB are presented in (15) ADVP+PP and PP+ADVP were within the top coordinations from the COCA data as well. Their frequent co-occurrence likely has to do with ADVP's and PP's shared purpose of adverbial modification in adjunct position. A null functional morpheme could be used to explain this coordination, and this idea would be quite similar to Bowers's Pred (predicate) proposal but applied to adjuncts of verbs instead of complements.

Differences Between Conjunct Positions
When considering each phrasal category in isolation and controlling for their different total frequencies, in both the COCA and the PTB, NPs had a very strong tendency toward being in the first conjunct position, and SBARs had a very strong tendency toward the second conjunct position. In the COCA data, VPs had a very strong tendency toward the second conjunct position, and ADJPs had a moderate tendency toward the first conjunct position. It seems like phrasal categories that can be very short, like NPs, are more likely to appear as the first conjunct, but longer phrases, like CPs or VPs, are more likely to be the second conjunct. This may be related to a phenomenon called heavy NP shift, in which a noun phrase appears to the right of its expected canonical position due to its "weight" (Kayne, 1994, Chapter 7). Example (16) explores heavy NP shift through prepositional dative constructions, where the recipient of a ditransitive verb (in this case, "Jen") is the object of the preposition to (Colleman et al., 2010 Shifting can also target syntactic categories other than noun phrases. In (18a), the complement and adjunct of the noun "statue" appear in their expected positions, with the complement [ PP of him] closer to the noun. In (18b), the complement is heavier than the adjunct and thus appears further to the right. The main idea behind shifting can be applied to coordination and the trends that we observed in the results section regarding asymmetry in conjunct positions. If heavier constituents undergo shifting to appear after lighter constituents within phrases, this would explain why longer and more complex conjuncts tend to appear in the second conjunct position of coordination phrases. Example (19) shows this intuition through the like coordination of two NPs with different lengths.

Limitations
One shortcoming of this paper lies in the evaluation plan: the human reviewers were not blind to the labels given to coordination phrases by the parser. With more resources, a future iteration of this project could include the creation of a small gold standard dataset of coordinations and use the more formal precision, recall, and F1 metrics to gauge the parser's accuracy in the identification of coordinations. Still, the raters' evaluations reveal the limitations of an analysis that utilizes an existing constituency-based parser on raw COCA data, which includes a size of parse errors. We acknowledge the drawbacks of such an approach and have supplemented the analysis of COCA data with data from the Penn Treebank for this purpose, which is not processed using a parser. These data sources together provide more concrete examples of the possibilities of unlike constituent coordination.

Conclusion
This paper approached the problem of understanding the syntax of two-termed coordination phrases through a computational corpus analysis. Previous research has not attempted a thorough analysis of coordination based on English corpora, instead relying on intuitive acceptability judgments to inform their theories. We conducted a syntactic analysis by extracting coordination phrases from the Corpus of Contemporary American English and the Penn Treebank, and we investigated the most common unlike coordinations and the syntactic categories that appeared in either of the two conjunct positions. Some of the findings from this project have interesting implications for coordination and syntax as a whole. The high frequency of coordinations of noun phrases with subordinate clauses provides further proof that noun phrases and clauses share similar syntactic distributions and may be structurally defined as determiner phrases. The tendency for first conjuncts to be shorter constituents and second conjuncts to be longer ones might suggest that shifting occurs in coordination structures as well. One of the main takeaways from these results is that there are evident syntactic distinctions between the two conjuncts of a coordination phrase, which support theories that posit an antisymmetric account for the structure of coordination.

A Heatmap of Unlike Coordinations in COCA
For completion, we include the frequency distribution of unlike coordinations for all 30 combinations of categories in the COCA data. Figure 6 visualizes these data in the form of a heatmap.

B Top Unlike Coordinations by Genre and Conjunction
The figures is in this appendix display the most frequent unlike category coordinations for each COCA genre and for each type of coordinating conjunction (and, or, but, nor) from the COCA data. Figures 7, 8, 9, 10, and 11 correspond to each of the five COCA genres, and the coordination frequencies are taken relative to all unlike coordinations within that genre. Figures 12, 13, 14, and 15 correspond to each of the four coordinating conjunctions, and the coordination frequencies are taken relative to all unlike coordinations that use the given conjunction.