Automatic Food Categorization from Large Unlabeled Corpora and Its Impact on Relation Extraction

We present a weakly-supervised induction method to assign semantic information to food items. We consider two tasks of categorizations being food-type classi-ﬁcation and the distinction of whether a food item is composite or not. The cate-gorizations are induced by a graph-based algorithm applied on a large unlabeled domain-speciﬁc corpus. We show that the usage of a domain-speciﬁc corpus is vital. We do not only outperform a manually designed open-domain ontology but also prove the usefulness of these categorizations in relation extraction, outperforming state-of-the-art features that include syntactic information and Brown clustering.


Introduction
In view of the large interest in food in many parts of the population and the ever increasing amount of new dishes/food items, there is a need of automatic knowledge acquisition. We approach this task with the help of natural language processing.
We investigate different methods to assign categories to food items. We focus on two categorizations, being a classification of food items to categories of the Food Guide Pyramid (U.S. Department of Agriculture, 1992) and a categorization of whether a food item is composite or not.
We present a semi-supervised graph-based approach to induce these food categorizations from an unlabeled domain-specific text corpus crawled from the Web. The method only requires minimal manual guidance for the initialization of the algorithm with seed terms. It depends, however, on an automatically constructed high-quality similarity graph. For that we choose a pattern-based representation that outperforms a distributionalbased representation. For initialization, we examine some manually compiled seed words and a very few simple surface patterns to automatically induce such expressions. As a hard baseline, we compare the effectiveness of using a generalpurpose ontology for the same types of categorizations. Apart from an intrinsic evaluation, we also examine the categories in relation extraction.
The contributions of this paper are a method requiring minimal supervision for a comprehensive classification of food items and a proof of concept that the knowledge that can thus be gained is beneficial for relation extraction. Even though we focus on a specific domain, the induction method can be easily translated to other domains. In particular, other life-style domains, such as fashion, cosmetics or home & gardening, show parallels since comparable textual web data are available and similar relation types (e.g. that two items fit together or can be substituted by each other) exist.
Our experiments are carried out on German data but our findings should carry over to other languages since the issues we address are (mostly) language universal. For general accessibility, all examples are given as English translations.

Domain-Specific Text Corpus
In order to generate a dataset for our experiments, we used a crawl of chefkoch.de 1 (Wiegand et al., 2012b) consisting of 418, 558 webpages of foodrelated forum entries. chefkoch.de is the largest German web portal for food-related issues.

Food Categorization
As a food vocabulary, we employ a list of 1888 food items: 1104 items were directly extracted from GermaNet (Hamp and Feldweg, 1997), the German version of WordNet (Miller et al., 1990). The items were identified by extracting all hyponyms of the synset Nahrung (English: food). By

Task I: Food Types
The food type categories we chose are mainly inspired by the Food Guide Pyramid (U.S. Department of Agriculture, 1992) that divides food items into categories with similar nutritional properties. This categorization scheme not only divides the set of food items in many intuitive homogeneous classes but it is also the scheme that is most commonly agreed upon. Table 1 lists the specific categories we use. For category assignment of complex dishes comprising different food items we applied a heuristics: we always assign the category that dominates the dish. A meat sauce, for example, would thus be assigned MEAT (even though there may be other ingredients than meat).

Task II: Dishes vs. Atomic Food Items
In addition to Task I, we include another categorization that divides food items into dishes and atomic food items (

Graph-based Induction
We propose a semi-supervised graph-based approach to label food items with their respective  food categories. The underlying data structure is a similarity graph connecting different food items. Food items that belong to the same category should be connected by highly weighted edges. In order to infer the labels for each respective food item, one first needs to specify a small set of seeds for each category and then apply a graph-based clustering method that divides the graph into clusters that represent distinct food categories. Our method is a low-resource approach that can also be easily adapted to other domains. The only domain-specific information required are an unlabeled corpus and a set of seeds.

Construction of the Similarity Graph
To enable a graph-based induction, we generate a similarity graph that connects similar food items. For that purpose, a list of domain-independent similarity-patterns was compiled. Each pattern is a lexical sequence that connects the mention of two food items (Table 3). Each pair of food items observed with any of those patterns is connected via a weighted edge (the different patterns are treated equally). The weight is the total frequency of all patterns co-occurring with a particular food pair. Due to the high precision of our patterns, with one or a few prototypical seeds we cannot expect to find all items of a food category within the set of items to which the seeds are directly connected. Instead, one also needs to consider transitive connectedness within the graph. For example, in Figure 1 banana and redberry are not directly connected but they can be reached via pear or raspberry. However, by considering mediate relationships it becomes more difficult to determine the most appropriate category for each food item since most food items are connected to food items of different categories (in Figure 1, there are not only edges between banana and other types of fruits but there is also some edge to some sweet, i.e. chocolate). For a unique class assignment, we apply a robust graph-based clustering algorithm. (It will figure out that banana, pear, raspberry and redberry belong to the same category and chocolate belongs to another category, since it is mostly Patterns food item1 (or|or rather|instead of|"(") food item2   linked to many other food items not being fruits.)

Semi-Supervised Graph Optimization
Our semi-supervised graph optimization (Belkin and Niyogi, 2004) is a robust algorithm that was primarily chosen since it only contains few free parameters to adjust. It is based on two principles: First, similar data points should be assigned similar labels, as expressed by a similarity graph of labeled and unlabeled data. Second, for labeled data points the prediction of the learnt classifier should be consistent with the (actual) gold labels. We construct a weighted transition matrix W of the graph by normalization of the matrix with co-occurrence counts C which we obtain from the similarity graph ( §3.1.1). We use the common normalization by a power of the degree function and W ii = 0. The normalization weight λ is the first of two parameters used in our experiments for semi-supervised graph optimization. For learning the semi-supervised classifier, we use the method of Zhou et al. (2004) to find a classifying function which is sufficiently smooth with respect to both the structure of unlabeled and labeled points. Given a set of data points X = {x 1 , . . . , x n } and label set L = {1, . . . , c}, with x i:1≤i≤l labeled as y i ∈ L and x i:l+1≤i≤n unlabeled. For prediction, a vectorial function F : X → R c is estimated assigning a vector F i of label scores to every x i . The predicted labeling follows from these scores asŷ i = arg max j≤c F ij . Conversely, the gold labeling matrix Y is a n × c matrix with Y ij = 1 if x i is labeled as y i = j and Y ij = 0 otherwise.
Minimizing the cost function Q aims at a tradeoff between information from neighbours and initial labeling information, controlled by parameter  µ (the second parameter used in our experiments): The first term in Q is the smoothness constraint, its minimization leads to adjacent edges having similar labels. The second term is the fitting constraint, its minimization leads to consistency of the function F with the labeling of the data. The solution to the above cost function is found by solving a system of linear equations (Zhou et al., 2004).
As we do not possess development data for this work, we set the two free parameters λ = 0.5 and µ = 0.01. This setting is used for both induction tasks and all configurations. It is a setting that provided reasonable results without any notable bias for any particular configuration we examine.

Manually vs. Automatically Extracted Seeds
We explore two types of seed initializations: (a) a manually compiled list of seed food items and (b) a small set of patterns (Table 4) by the help of which such seeds are automatically extracted. In order to extract seeds for Task I with the pattern-based approach, we apply the patterns from Hearst (1992). These patterns have been designed for the acquisition of hyponyms. Task I can also be regarded as some type of hyponym extraction. The food types (fruit, meat, sweets) represent the hypernyms for which we extract seed hyponyms (banana, beef, chocolate).
In order to extract seeds for Task II, we apply two domain-specific sets of patterns (patt dish and patt atom ). We rank the food items according to the frequency of occurring with the respective pattern set. Since food items may occur in both rankings, we merge the two rankings in the following way: The top end of this ranking represents dishes while the bottom end represents atoms.

Using a General-Purpose Ontology
As a hard baseline, we also make use of the semantic relationships encoded in GermaNet. Our two types of food categorization schemes can be approximated with the hypernymy graph in that ontology: We manually identify nodes that resemble our food categories (e.g. fruit, meat or dish) and label any food item that is an immediate or a mediate hyponym of these nodes (e.g. apple for fruit) with the respective category label. The downside of this method is that a large amount of food items is missing from the GermaNet-database ( §2.2).

Other Baselines & Post-Processing
In addition to the previous methods we implement a heuristic baseline (HEUR) that rests on the observation that German food items of the same food category often share the same suffix, e.g. Schokoladenkuchen (English: chocolate cake) and Apfelkuchen (English: apple pie). For HEUR, we manually compiled a set of few typical suffixes for each food type/dish category (ranging from 3 to 8 suffixes per category). For classification of a food item, we assign the food item the category label whose suffix matched with the food item. 2 We also examine an unsupervised baseline (UNSUP) that applies spectral clustering on the similarity graph following von Luxburg (2007): • Input: a similarity matrix W and the number of categories to detect k.
• The laplacian L is constructed from W . It is the symmetric laplacian L = I − D 1/2 W D 1/2 , where D is a diagonal degree matrix. 3 • A matrix U ∈ R n×k is constructed that contains as columns the first k eigenvectors u1, . . . , u k of L. • The rows of U are interpreted as the new data points. The final clustering is obtained by k-means clustering of the rows of U.
UNSUP (which is completely parameter-free) gives some indication about the intrinsic expressiveness of the similarity graph as it lacks any guidance towards the categories to be predicted.
In graph-based food categorization, one can only make predictions for food items that are connected (be it directly or indirectly) to seed food items within the similarity graph. To expand labels to unconnected food items, we apply some postprocessing (POSTP). Similarly to HEUR, it exploits the suffix-similarity of food items. It assigns each unconnected food item the label of the food item (that could be labeled by the graph optimization) that shares the longest suffix. Due to their similar nature, we refrain from applying POSTP on HEUR as it would produce no changes.
2 Unlike German food items, English food items are often multi-word expressions. Therefore, we assume that for English, instead of analyzing suffixes the usage of the head of a multiword expression (i.e. chocolate cake) would be an appropriate basis for a similar heuristic.
3 That is, Dii equals to the sum of the ith row.

Experiments
We report precision, recall and F-score and accuracy. 4 For precision, recall and F-score, we list the macro-averaged score. Table 5 compares different classifiers and configurations for the prediction of food types (against the gold standard from Table 1). Apart from the previously described baselines, we consider n manually selected prototypes (n-PROTO) and the top n food items produced by Hearst-patterns (PAT-Topn) as seeds for graph-based optimization. The table shows that the semi-supervised graph-based approach with these seeds outperforms the baselines UNSUP and HEUR. Only as few as 5 prototypical seeds (per category) are required to obtain performance that is even better than using plain GermaNet. The table also shows that post-processing (with our suffix-heuristics) consistently improves performance. Manually choosing prototypes is more effective than instantiating seeds via Hearst-patterns. The quality of the output of Hearst-patterns degrades from top 10 onwards. However, considering that PAT-Topn does not include any manual intervention, it already produces decent results. Finally, even GermaNet can be effectively used as seeds.  Table 6: Comparison of different classifiers distinguishing between dishes and atomic food items (graph indicates graph-based optimization).

Configuration graph Acc Prec Rec F1 Acc Prec Rec F1
PAT-Top100 (  heterogeneous classes which is why more seeds are required for initialization. This means that we cannot look for prototypes. For simplicity, we resorted to randomly sample seeds from our gold standard (RAND-n). For HEUR, we could not find a small and intuitive set of suffixes that are shared by many atomic food types, therefore we considered all food types from our vocabulary whose suffix did not match a typical dish suffix as atomic. As this leaves no unspecified food items in our vocabulary, we cannot use the output of HEUR as seeds for graph-based optimization. In contrast to the previous experiment, HEUR is a more robust baseline. But again, post-processing mostly improves performance, and patterns are not as good as manual (random) seeds yet the former are notably better than HEUR w.r.t. F-Score. Unlike in the food-type classification, graph-based optimization applied on GermaNet does not result in some improvement. We assume that the precision of plain GermaNet with 81.3% is too low. 5 Since GermaNet cannot effectively be used as seeds for the graph-based optimization and postprocessing has already a strong positive effect, we may wonder how effective the actual graph-based optimization is for this classification task. After all, significantly more seeds are required for this classification task than for the previous task, so we need to show that it is not the mere seeds (+post-processing) that are required for a reasonable categorization. Table 7 examines two key configurations with and without graph-based optimization. It shows that also for this classification task, graph-based optimization produces a categorization superior to the mere seeds. Moreover, the suffix-based post-processing is complementary to the improvement by the graph-based optimization. Table 8 compares for each food type 5 manually selected prototypical seeds (i.e. 5-PROTO) and the 5 food items most frequently been observed with patt hearst (Table 4). While the manually chosen seeds represent the spectrum of food items within each particular class (e.g. for STARCH, some type of pasta, rice and potato was chosen), it is not possible to enforce such diversity with the automatically extracted seeds. However, most food items are correct. Table 9 displays the 10 most highly ranked dishes and atomic food items extracted with patt dish and patt atom (Table 4). Unlike the previous task (Table 8), we obtain more heterogeneous seeds within the same class.

Distributional Similarity
Since many recent methods for related tasks, such as noun classification, are based on so-called distributional similarity (Riloff and Shepherd, 1997;Lin, 1998;Snow et al., 2004;Weeds et al., 2004;Yamada et al., 2009;Huang and Riloff, 2010;Lenci and Benotto, 2012), we also examine this as an alternative representation to the pattern-based similarity graph (Table 3). We represent each food item as a vector which itself is an aggregate of the contexts of all mentions of a particular food item. We weighted the individual (context) words co-occurring with the food item at a fixed window size of 5 words with tf-idf. We can now apply graph-based optimization on the similarity matrix encoding the cosine similarities between any possible pair of vectors representing two food items. As seeds, we use the best configuration (not employing GermaNet), i.e. 10-PROTO for food type classification and RAND-100 for the dish classification. Since, however, the graph clustering is not actually necessary, as we have a full similarity matrix (rather than a sparse graph) that also al-Class 5 Manually Chosen Seeds (5-PROTO) 5 Hearst-Pattern Seeds (PAT-Top5) MEAT schnitzel, rissole, bologna, redfish, trout salmon, beef, chicken, turkey hen, poultry BEVERAGE coffee, tea, water, beer, coke coffee, beer, mineral water, lemonade, tea VEGE peas, green salad, tomato, cauliflower, carrot zucchini, lamb's salad, broccoli, leek, cauliflower SWEET chocolate, torte, popcorn, apple pie, potato crisps wine gum, marzipan, custard, pancake, biscuits SPICE pepper, cinnamon, salt, gravy, remoulade cinnamon, laurel, clove, tomato   lows us to compare any arbitrary pair of food items directly, we also employ a second classifier (for comparison) based on the nearest neighbour principle. We assign each food item the label of the most similar seed food item. Table 10 compares these two classifiers with the best previous result. It shows that the patternbased representation consistently outperforms the distributional representation. The former may be sparse but it produces high-precision similarity links. 6 The vector representation, on the other hand, may not be sparse but it contains a high degree of noise. The major problem is that not only vectors of similar food items, such as chips (fries), potatoes and rice, are similar to each other, but also vectors of different food items that are typically consumed with each other (e.g. fish and chips). This is because of their frequent cooccurrence (as in collocations like fish & chips). Unfortunately, these pairs belong to different food types. For the dish classification, however, the vector representation is less of a problem. 7 The distributional representation works better with the simple nearest neighbour classifier. We assume that graph-based optimization adds further noise to the classification since, unlike the nearest neighbour which only calculates the direct similarity between two vectors, it also incorporates indirect relationships (which may be more error-prone than the direct relationships) between food items.

Do we need a domain-specific corpus?
In this section, we want to provide evidence that apart from the similarity graph and seeds the textual source for the graph, i.e. our domain-specific 6 By the label propagation within the graph-based optimization, the sparsity problem is also mitigated. 7 Fish and chips are both atoms, so in the dish classification, it is no mistake to consider them similar food items.   corpus (chefkoch.de), is also important. For that purpose, we compare our current corpus against an open-domain corpus. We consider the German version of Wikipedia since this resource also contains encyclopedic knowledge about food items. Table 11 compares the graph-based induction. As in the previous section, we only consider the best previous configuration. The table clearly shows that our domain-specific text corpus is a more effective resource for our purpose than Wikipedia.

Evaluation for Relation Extraction
We now examine whether automatic food categorization can be harnessed for relation extraction. The task is to detect instances of the relation types SuitsTo, SubstitutedBy and IngredientOf introduced Wiegand et al. (2012b) (repeated in Table 12) and motivated in Wiegand et al. (2012a). These relation types are highly relevant for customer advice/product recommendation. In particular, SuitsTo and SubstitutedBy are fairly domainindependent relation types. Customers want to know which items can be used together (SuitsTo), be it two food items that can be used as a meal or two fashion items that can be worn together. Substitutes are also relevant for situations in which item A is out of stock but item B can be offered as an alternative. Therefore, insights from this work should carry over to other domains. We randomly extracted 1500 sentences from our text corpus ( §2.1) in which (at least) two food items co-occur. Each food pair mention was manually assigned one label. In addition to the three relation types from above, we introduce the label Other for cases in which either another relation between the target food items is expressed or the co-occurrence is co-incidental. On a subset of 200 sentences, we measured a substantial interannotation agreement of Cohen's κ = 0.67 (Landis and Koch, 1977).
We train a supervised classifier and incorporate the knowledge induced from our domain-specific corpus as features. We chose Support Vector Machines with 5-fold cross-validation using SVM lightmulti-class (Joachims, 1999). Table 13 displays all features that we examine for supervised classification. Most features are widely used throughout different NLP tasks. One special feature brown takes into consideration the output of Brown clustering (Brown et al., 1992) which like our graph-based optimization produces a corpus-driven categorization of words. Similar to UNSUP, this method is unsupervised but it considers the entire vocabulary of our text corpus rather than only food items. Therefore, this information can be considered as a generalization of all contextual words. Such type of information has been shown to be useful for named-entity recognition (Turian et al., 2010) and relation extraction (Plank and Moschitti, 2013).

Why should food categories be helpful
for relation extraction? All relation types we consider comprise pairs of two food items which makes these relation types likely to be confused. Contextual information may be used for disambiguation but there may also be frequent contexts that are not sufficiently informative. For example, 25% of the instances of Ingre-  dientOf follow the lexical pattern food item 1 with food item 2 (1). However, the same pattern also covers 15% of the instances of SuitsTo (2).
(1) We had a stew with red lentils. (Relation: IngredientOf) (2) We had salmon with broccoli. (Relation: SuitsTo) The food type information we learned from our text corpus might tell us which of the food items are dishes. Only in (1), there is a dish, i.e. stew. So, one may infer that the presence of dishes is indicative of IngredientOf rather than SuitsTo.
food item 1 and food item 2 is another ambiguous context. It cannot only be observed with the relation SuitsTo, as in (3) (66% of all instantiations of that pattern), but also SubstitutedBy (20% of all mentions of that relation match that pattern), as in (4). For SuitsTo, two food items that belong to two different classes (e.g. MEAT and STARCH or MEAT and VEGE) are quite characteristic. For SubstitutedBy, the two food items are very often of the same category of the Food Guide Pyramid. Since the second ambiguous context involves the two general relation types SuitsTo and Substi-tutedBy, resolving this ambiguity with automatically induced type information has some significance for other domains. In particular, for other life-style domains, domain-specific type information could be obtained following our method from §3.1. The disambiguation rule that two entities of the same type imply SubstitutedBy otherwise they imply SuitsTo should also be widely applicable.  part-of-speech sequence between target food items and tags of the words immediately preceding and following them synt path from syntactic parse tree from first target food item to second target food item conj conjunctive features: patt with brown classes of target food items; pos sequence with brown classes of target food items; synt with brown classes of target food items graph semantic food information induced by graph optimization (config.: 10-PROTO(+POSTP) and RAND-100(+POSTP)) germanet semantic food information derived from (plain) GermaNet designed for a particular relation type, one can read off from the matching pattern which class is predicted). word is slightly better but, unlike patt, it is dependent on supervised learning.

Results
The only feature that individually manages to significantly outperform word is graph. The traditional features (i.e. pos, synt and brown) only produce some mild improvement when added jointly to word along some conjunctive features. When graph is added to this feature set (i.e. word+patt+pos+synt+brown+conj), we obtain another significant improvement. In conclusion, the information we induced from our domain-specific corpus cannot be obtained by other NLP-features, including other state-of-theart induction methods such as Brown clustering.

Related Work
While many of the previous works on noun categorization also address the task of hypernym classification (Hearst, 1992;Caraballo, 1999;Widdows, 2003;Kozareva et al., 2008;Huang and Riloff, 2010;Lenci and Benotto, 2012) and some include examples involving food items (Widdows and Dorow, 2002;Cederberg and Widdows, 2003)  The task of data-driven lexicon expansion has also been explored before (Kanayama and Nasukawa, 2006;Das and Smith, 2012), however, our paper presents the first attempt to carry out a comprehensive categorization for the food domain. For the first time, we also show that type information can effectively improve the extraction of very common relations. For the twitter domain, the usage of type information based on clustering has already been found effective for supervised learning (Bergsma et al., 2013).

Conclusion
We presented an induction method to assign semantic information to food items. We considered two types of categorizations being food-type information and information about whether a food item is composite or not. The categorization is induced by graph-based optimization applied on a large unlabeled domain-specific text corpus. We produce categorizations that outperform a manually compiled resource. The usage of such a domainspecific corpus based on a pattern-based representation is vital and largely outperforms other text corpora or a distributional representation. The induced knowledge improves relation extraction.