Annotation and Classification of Light Verbs and Light Verb Variations in Mandarin Chinese

Light verbs pose an a challenge in linguistics because of its syntactic and semantic versatility and its unique distribution different from regular verbs with higher semantic content and selectional resrictions. Due to its light grammatical content, earlier natural language processing studies typically put light verbs in a stop word list and ignore them. Recently, however, classification and identification of light verbs and light verb construction have become a focus of study in computational linguistics, especially in the context of multi-word expression, information retrieval, disambiguation, and parsing. Past linguistic and computational studies on light verbs had very different foci. Linguistic studies tend to focus on the status of light verbs and its various selectional constraints. While NLP studies have focused on light verbs in the context of either a multi-word expression (MWE) or a construction to be identified, classified, or translated, trying to overcome the apparent poverty of semantic content of light verbs. There has been nearly no work attempting to bridge these two lines of research. This paper takes this challenge by proposing a corpus-bases study which classifies and captures syntactic-semantic difference among all light verbs. In this study, we first incorporate results from past linguistic studies to create annotated light verb corpora with syntactic-semantics features. We next adopt a statistic method for automatic identification of light verbs based on this annotated corpora. Our results show that a language resource based methodology optimally incorporating linguistic information can resolve challenges posed by light verbs in NLP.


Introduction
Identification of Light Verb Construction (LVC) plays an important role and poses a special challenge in many Natural Language Processing (NLP) applications, e.g. information retrieval and machine translation. In addition to addressing issues related to LVC as a contributing factor to errors for various applications, a few computational linguistics studies have targeted LVC in English specifically (e.g., Tu and Roth, 2011;Nagy et al., 2013). To the best of our knowledge, however, there has been no computational linguistic study dealing with LVCs in Chinese specifically. It is important to know that, due to their lack of semantic content, light verbs can behave rather idiosyncratically in each language. Chinese LVC, in particular, has the characteristic that allows many different light verbs to share similar usage and be interchangeable in some context. We should also note that light verbs in Chinese can take both verbs, deverabal nouns, and eventive nouns, while the morphological status of these categories are typically unmarked, Hence, it is often difficult to differentiate a light verb from its non-light verb uses without careful analysis of the data.
It has been observed that some Chinese light verbs can be used interchangeably but will have different selectional restrictions in some (and generally more limited) contexts. For example, the five light verbs congshi, gao, jiayi, jinxing, zuo (these words originally meant 'engage ', 'do', 'inflict', 'proceed', 'do' respectively) can all take yanjiu 'to do research' as their complement and form a LVC. However, only the light verbs gao and jinxing can take bisai 'to play games' as complements, whereas the other light verbs congshi, jiayi, and zuo cannot. Since light verbs are often interchangeable yet each also has its own selectional restrictions, it makes the identification of light verbs themselves both a challenging and necessary task. It is also observed that this kind of selectional versatility actually led to variations among different variants of Mandarin Chinese, such as Mainland and Taiwan. The versatility of Chinese light verbs makes the identification of LVCs more complicated than English. Therefore, to study the differences among different light verbs and different variants of Chinese is important but challenging in both linguistic studies and computational applications. With annotated data from comparable corpora of Mainland and Taiwan Mandarin Chinese, this paper proposes both statistical and machine learning approaches to differentiate five most frequently used light verbs in both variants based on their syntactic and semantic features. The experimental results of our approach show that we can reliably differentiate different light verbs from each other in each variety of Mandarin Chinese.
There are several contributions in our work. Firstly, rather than focusing on only two light verbs jiayi and jinxing as in previous linguistic studies, we extended the study to more light verbs that are frequently used in Chinese. Actually, we will show that although jiayi and jinxing were often discussed in a pair in previous literature, the two are quite different from each other. Secondly, we show that statistical analysis and machine learning approaches are effective to identify the differences of light verbs and the variations demonstrated by the same light verb in different variants of Chinese. Thirdly, we provide a corpus that covers all typical uses of Chinese light verbs. Finally, the feature set we used in our study could be potentially used in the identification of Chinese LVCs in NLP applications.
This paper is organized as follows. Section 2 describes the data and annotation of the data. In Section 3, we conducted both statistical and machine learning methodologies to classify the five light verbs in both Mainland and Taiwan Mandarin. We discussed the implications and applications of our methodologies and the findings of our study in Section 4. Section 5 presents the conclusion and our future work.

Data Collection
The data for this study is extracted from Annotated Chinese Gigaword corpus (Huang, 2009) which was collected and available from LDC and contains over 1.1 billion Chinese words, with 700 million characters from Taiwan Central News Agency and 400 million characters from Mainland Xinhua News Agency.
The light verbs to be studied are congshi, gao, jiayi, jinxing, zuo; these five are among the most frequently used light verbs in Chinese (Diao, 2004). 400 sentences are randomly selected for each light verb, half from the Mainland Gigaword subcorpus and the other from the Taiwan Gigaword subcorpus, which resulted in 2,000 sentences in total. The selection follows the principle that it could cover the different uses of each light verb.

Feature Annotation
Previous studies (Zhu, 1985;Zhou, 1987;Cai, 1982;Huang et al., 1995;Huang et al., 2013, among others) have proposed several syntactic and semantic features to identify the similarities and differences among light verbs, especially between the two most typical ones, i.e. jinxing (originally 'proceed') and jiayi (originally 'inflict'). For example, jinxing can take aspectual markers like zhe 'progressive marker', le 'aspect marker', and guo 'experiential aspect marker' while jiayi cannot (Zhou, 1987); congshi can take nominal phrases such as disan chanye'the tertiary industry' as its complement while jiayi cannot. A few features are also found to be variant-specific; for example,  find that only the congshi in Taiwan, but not in Mainland Mandarin, can take informal and negative event complements like xingjiaoyi 'sexual trade'.
In our study, we selected 11 features which may help to differentiate different light verbs in each Mandarin variant as well as light verb variations among Mandarin variants, as in Table 1. All 2,000 examples collected for analysis were manually annotated based on the 11 features. The annotator is a trained expert on Chinese linguistics. Any ambiguous cases were discussed with another two experts in order to reach an agreement.

Identification of light verbs based on annotated corpora
In this section, we adopted both statistical analysis and machine learning approaches to identify the five light verbs (jiayi, jinxing, congshi, gao and zuo) on the corpora with 2,000 annotated examples.
The results of all approaches show that the five light verbs can be differentiated from each other in both Mainland and Taiwan Mandarin.

Identifying light verbs by statistical analysis
Both univariate analysis and multivariate analysis were used in our study for the identification. The tool we used is the Polytomous Package in R (Arppe, 2008).

Univariate analysis
Among the 11 independent features, one was found with only one level in both Mainland and Taiwan variants, i.e. all five light verbs in the two variants show the same preference over the features and thus excluded from the analysis. The feature is OTHERLV (all light verbs do not co-occur with another light verb in a sentence). Chi-squared tests were conducted for the significance of the co-occurrence of the remaining ten features with individual light verbs in both Mainland and Taiwan variants. The chisq.posthoc() function in the Polytoumous Package (Arppe, 2008) in R was used for the tests. The results are presented in Table 2, where the "+" and "-" signs indicate respectively a statistically significant overuse and underuse of a light verb with a feature, and "0" refers to a lack of statistical significance.
Feature N Mainland Mandarin Taiwan Mandarin congshi gao jiayi jinxing zuo congshi gao jiayi jinxing zuo POS.N 585  Table 2 suggests that in both Mainland and Taiwan Mandarin, each light verb shows significant preference for certain features, and thus can be distinguished from each other. For example, in Mainland Mandarin, although both congshi and gao show significant preference for the features POS.N and ACCOMPEVT.no, congshi differs from gao in that it also significantly prefers DUREVT.yes (taking complements denoting durative events, e.g., yanjiu 'to research'), EVECOMP.no (event complements do not occur in subject position), and INTEREVT.no (not taking complements denoting events involving interaction among participants, e.g., taolun 'to discuss'), whereas gao shows either a dispreference or no significant preference over these features. Take gao and zuo in Taiwan Mandarin as another example. While both light verbs literally means 'to do', there is no single feature preferred by both: gao prefers POS. N, ARGSTR.zero, FOREVT.yes, INTEREVT.no, ACCOMPEVT.no, whereas zuo shows significant preferences for POS.V, ARGSTR.two, ASP.le, and PSYEVT.yes.

Multivariate analysis
As shown in Table 2, in both Mainland and Taiwan Mandarin, some of the five light verbs share some features, which thus explains why sometimes they can be interchangeably used. This also indicates (a) that a particular feature is unlikely to be preferred by only one light verb and thus differentiates the verb from the others; (b) a certain context may allow the occurrence of more than one light verb. In this sense, a multivariate analysis was adopted to better classify the five light verbs in each variant. The multivariate analysis used in the current study is polytomous logistic regression (Arppe, 2008), and the tool we used is the Polytomous() function in the Polytoumous Package (Arppe, 2008) The results from the multivariate analysis were summarized in Table 3. The numbers shown in the table are the odds for the features in favor of or against the occurrence of each light verb: when the estimated odd is larger than 1, the chance of the occurrence of a light verb is significantly increased by the feature, e.g., the chance of Mainland jiayi occurring is significantly increased by ARGSTRtwo (76.47:1), followed by ACCOMPEVTyes (56:1), VOCOMPyes (23.54: 1), and PSYEVTyes (19.87: 1). When the estimated odd is smaller than 1, the chance of the occurrence of a light verb is significantly decreased by the feature, e.g., the chance of Mainland jinxing occurring is significantly decreased by ACCOMPEVTyes (0.1849: 1); in addition, "inf" and "1/inf" refer to odds larger than 10,000 and smaller than 1/10,000 respectively, whereas non-significant odds (p-value < 0.05) are given in parentheses.  As shown in Table 3, each of the light verbs in each Mandarin variant shows its favor and disfavor of certain features. Take Mainland Mandarin for example: although congshi has no feature significantly in its favor, but it is significantly disfavored by ARGSTRtwo (0.27:1) and ITEREVTyes (0.03:1); gao is disfavored by the aggregate of default variable values (0.02:1), and ACCOMPEVTyes (0.1:1), but is significantly favored by ARGSTRtwo and ARGSTRzero; the chance of jiayi's ocucrrence is significantly increased by ARGSTRtwo(76.47:1), ACCOMPEVTyes (56.25:1), VOCOMPyes (23.54:1), and PSYEVTyes (19:87:1); jinxing has INTEREVTyes and EVECOMPyes in its favor, but ACOMPEVTyes in its disfavor; no feature is significantly in the favor of zuo, but this light verb is significantly disfavored by ARGSTRtwo, ARGSTRzero, FOREVTyes and INTEREVTyes.
The results in Table 3 also show that sometimes one key feature is able to identify two light verbs from each other, although not all five light verbs. Take Mainland Mandarin again for example. Most combinations of two light verbs from the five can be effectively differentiated by one feature. For instance, the feature ARGSTRtwo can differentiate congshi/gao, congshi/jiayi, jiayi/zuo and gao/zuo; the feature INTEREVTyes can differentiate congshi/jinxing and jinxing/zuo; the feature ACCOMPEVTyes can differentiate the pairs gao/jiayi and jinxing/jiayi.

Identifying light verbs by classification
In this section, we resorted to machine learning technologies to study the same issue. Different classifiers were adopted to discriminate the five light verbs with the annotated corpora: ID3, Logistic Regression, Naï ve Bayesian and SVM that are implemented in WEKA (Hall et al., 2009) and 10-fold cross validations were performed separately on the Taiwan and Mainland corpora.
The results were presented in Table 4. We can see that different classifiers provide similar results on both corpora, which means that the classification results are reliable and the features we annotated are effective in identifying the five light verbs. Overall, ID3 out-performs SVM slightly, with Logistic and NB not far behind. ID3 performs the best since the data is in low dimension. The detailed results including precision, recall and F-measure by ID3 on both corpora are shown in Table 5. The corresponding confusion matrixes are presented in Table 6. The confusion matrixes suggest two very important generalizations: (a) all five verbs can be classified with good confidence, and (b) the overall classification patterns of the Mainland and Taiwan Mandarin are very similar, which is consistent with the fact that Mainland and Taiwan Mandarin are two variants. However, we also observe that the confusion matrixes between various light verb pairs may differ between Mainland and Taiwan Chineses. This is the difference we would like to explore in the next section to propose a way to automatically predict these two variants. In addition, it is worth noting that all classifiers identify jiayi more effectively than other light verbs, which thus shows a potential different usage of jiayi from the others.

Identifying light verbs by automatic clustering
We further used the clustering algorithm to test the differentiability of the five light verbs in both Mainland and Taiwan Mandarin. The results using the simple K-Means clustering algorithm on Taiwan and Mainland corpora are shown in Table 7. The results show that the light verb jiayi behaves quite differently from the other four light verbs in both Mainland and Taiwan corpora, which is similar to the analysis based on statistical methods in Section 3.1 and classification methods in Section 3.2. In both corpora, jiayi has a narrower usage than the other light verbs. Meanwhile, we can also find a cluster which is mainly formed by instances of jiayi from the Mainland corpus (i.e. cluster 0). After closer examination of the examples in this cluster, we found that it mainly includes sentences where jiayi takes complements denoting accomplishment events, e.g. gaizheng 'to correct' and jiejue 'to solve'. However, jiayi in Taiwan corpus mainly takes complements denoting activity events, and thus almost all instances of Taiwan jiayi are mixed with those of the other light verbs. Meanwhile, our results show a tendency that all other light verbs (jinxing, congshi, zuo, and gao) mostly take activity complements but fewer accomplishment complements in both Taiwan and Mainland corpora. More discussion on the light verb variations between Mainland and Taiwan Mandarin can be found in (Huang et al., 2014).

Implications for Future Studies
In the study above, we were able to annotate a corpus with all the types of significant context and, based on this annotated corpus, we were able to use statistic model to differentiate the use of different light verbs in different contexts. Such a module of generic linguistic tools can have several potentially very useful applications. First, in translation, LVC is one of the most difficult constructions as there is less grammatical or contextual information to make the correct translation. Our approach is especially promising. As we encode contextual selection information for all light verbs, the same approach can be applied to the other languages in the target-source pair to produce optimal pair. Second, in information extraction, selection of different light verbs often conveys subtle difference in meanings. Our ability to differentiate similar light verbs in the same context could have great potential in extracting the subtle information change/increase in the same context. Lastly, in second language learning as well as error detection, light verbs have been one of the most challenging ones. Our studies can be readily applied to either error detection or second language learning environment to provide the correct context where a certain light very is preferred over another.

From light verb variations to variants for the same language
One of the biggest challenges in computational processing of languages is probably to identify newly emergent variants, such as the cross-strait variations of Mandarin Chinese. For these two variants, the most commonly cited ones were on lexical differences. Systematic grammatical differences were much more difficult to study and hence rarely reported (comp. Huang et al., 2009). As these are two newly divergent variants, their main grammars are almost all identical, except for some subtle differences, such as the selection between different light verbs and their complements. Our preliminary results of univariate and multivariate analysis can be found in Table 2 and 3. It shows not only the similarities/differences among the light verbs in each variety (e.g., both ML and TW congshi and gao show preferences over POS.N, whereas both ML and TW jiayi show dispreference), but also the similarities/differences of the corresponding light verbs in Mainland and Taiwan Mandarin. For instance, jinxing in TW tends to take VO compounds as its complements e.g., jinxing toupiao "cast a vote", which is consistent with the analysis in  (see more in Huang et al., 2014). But one thing should be pointed out is the difference is more between a significant and non-significant feature, rather than between a significant positive and significant negative feature.

Conclusion
In this paper, we addressed the issue of automatic classification of Chinese light verbs based on their usage distribution, based on an annotated corpus marking relevant contextual information for light verbs. We used both statistical methods and machine learning technologies to address this issue. It is found that our approaches are effective in identifying light verbs and their variations. The automatic generated semantic and syntactic features can also be used for future studies on other light verbs as well as other lexical categories. The result suggested that richly annotated language resources paired with appropriate tool can lead to effective general solution for some common issues faced by linguistics and natural language processing.