Linguistic Issues in Language Technology, Volume 18, 2019 - Exploiting Parsed Corpora: Applications in Research, Pedagogy, and Processing
Abstract Meaning Representation (AMR) is a meaning representation framework in which the meaning of a full sentence is represented as a single-rooted, acyclic, directed graph. In this article, we describe an on-going project to build a Chinese AMR (CAMR) corpus, which currently includes 10,149 sentences from the newsgroup and weblog portion of the Chinese TreeBank (CTB). We describe the annotation specifications for the CAMR corpus, which follow the annotation principles of English AMR but make adaptations where needed to accommodate the linguistic facts of Chinese. The CAMR specifications also include a systematic treatment of sentence-internal discourse relations. One significant change we have made to the AMR annotation methodology is the inclusion of the alignment between word tokens in the sentence and the concepts/relations in the CAMR annotation to make it easier for automatic parsers to model the correspondence between a sentence and its meaning representation. We develop an annotation tool for CAMR, and the inter-agreement as measured by the Smatch score between the two annotators is 0.83, indicating reliable annotation. We also present some quantitative analysis of the CAMR corpus. 46.71% of the AMRs of the sentences are non-tree graphs. Moreover, the AMR of 88.95% of the sentences has concepts inferred from the context of the sentence but do not correspond to a specific word.
In this paper, we discuss constituent ordering generalizations in Japanese. Japanese has SOV as its basic order, but a significant range of argument order variations brought about by ‘scrambling’ is permitted. Although scrambling does not induce much in the way of semantic effects, it is conceivable that marked orders are derived from the unmarked order under some pragmatic or other motivations. The difference in the effect of basic and derived order is not reflected in native speaker’s grammaticality judgments, but we suggest that the intuition about the ordering of arguments may be attested in corpus data. By using the Keyaki treebank (a proper subset of which is NINJAL Parsed Corpus of Modern Japanese (NPCMJ)), it is shown that the naturallyoccurring corpus data confirm that marked orderings of arguments are less frequent than their unmarked ordering counterparts. We suggest some possible motivations lying behind the argument order variations.
This paper presents a case study of the use of the NINJAL Parsed Corpus of Modern Japanese (NPCMJ) for syntactic research. NPCMJ is the first phrase structure-based treebank for Japanese that is specifically designed for application in linguistic (in addition to NLP) research. After discussing some basic methodological issues pertaining to the use of treebanks for theoretical linguistics research, we introduce our case study on the status of the Coordinate Structure Constraint (CSC) in Japanese, showing that NPCMJ enables us to easily retrieve examples that support one of the key claims of Kubota and Lee (2015): that the CSC should be viewed as a pragmatic, rather than a syntactic constraint. The corpus-based study we conducted moreover revealed a previously unnoticed tendency that was highly relevant for further clarifying the principles governing the empirical data in question. We conclude the paper by briefly discussing some further methodological issues brought up by our case study pertaining to the relationship between linguistic research and corpus development.
No matter how comprehensively corpus builders design their annotation schemes, users frequently find that information is missing that they need for their research. In this methodological paper I describe and illustrate five methods of adding linguistic information to corpora that have been morphosyntactically annotated (=parsed) in the style of Penn treebanks. Some of these methods involve manual operations; some are executed by CorpusSearch functions; some require a combination of manual and automated procedures. Which method is used depends almost entirely on the type of information to be added and the goals of the user. Of course, the main goal, regardless of method, is to record within the corpus additional information that can be used for analysis and also retained through further searches and data processing.
The principal barrier to the uptake of technologies in schools is not technological, but social and political. Teachers must be convinced of the pedagogical benefits of a particular curriculum before they will agree to learn the means to teach it. The teaching of formal grammar to first language students in schools is no exception to this rule. Over the last three decades, most schools in England have been legally required to teach grammatical subject knowledge, i.e. linguistic knowledge of grammar terms and structure, to children age five and upwards as part of the national curriculum in English. A mandatory set of curriculum specifications for England and Wales was published in 2014, and elsewhere similar requirements were imposed. However, few current English school teachers were taught grammar themselves, and the dominant view has long been in favour of ‘real books’ rather than the teaching of a formal grammar. English grammar teaching thus faces multiple challenges: to convince teachers of the value of grammar in their own teaching, to teach the teachers the knowledge they need, and to develop relevant resources to use in the classroom. Alongside subject knowledge, teachers need pedagogical knowledge – how to teach grammar effectively and how to integrate this teaching into other kinds of language learning. The paper introduces the Englicious1 web platform for schools, and summarises its development and impact since publication. Englicious draws data from the fully-parsed British Component of the International Corpus of English, ICE-GB. The corpus offers plentiful examples of genuine natural language, speech and writing, with context and potentially audio playback. However, corpus examples may be ageinappropriate or over-complex, and without grammar training, teachers are insufficiently equipped to use them. In the absence of grammatical knowledge among teachers, it is insufficient simply to give teachers and children access to a corpus. Whereas so-called ‘classroom concordancing’ approaches offer access to tools and encourage bottom-up learning, Englicious approaches the question of grammar teaching in a concept-driven, top-down way. It contains a modular series of professional development resources, lessons and exercises focused on each concept in turn, in which corpus examples are used extensively. Teachers must be able to discuss with a class why, for instance, work is a noun in a particular sentence, rather than merely report that it is. The paper describes the development of Englicious from secondary to primary, and outlines some of the practical challenges facing the design of this type of teaching resource. A key question, the ‘selection problem’, concerns how tools parameterise the selection of relevant examples for teaching purposes. Finally we discuss curricula for teaching teachers and the evaluation of the effectiveness of the intervention.