Focused Web Corpus Crawling

In web corpus construction, crawling is a necessary step, and it is probably the most costly of all, because it requires expensive bandwidth usage, and excess crawling increases storage requirements. Excess crawling results from the fact that the web contains a lot of redundant content (duplicates and near-duplicates), as well as other material not suitable or desirable for inclusion in web corpora or web indexes (for example, pages with little text or virtually no text at all). An optimized crawler for web corpus construction would ideally avoid crawling such content in the ﬁrst place, saving bandwidth, storage, and post-processing costs. In this paper, we show in three experiments that two simple scores are suitable to improve the ratio be-tween corpus size and crawling effort for web corpus construction. The ﬁrst score is related to overall text quality of the page containing the link, the other one is related to the likelihood that the local block enclosing a link is boilerplate.


Crawl Optimization and Yield Ratios
Optimizing a crawling strategy consists in maximizing its weighted coverage W C(t) at any time t during a crawl (Olston and Najork, 2010, 29), i. e., the summed weight of the documents downloaded until t, where the weight of each crawled document is calculated as a measure of the usefulness of the document relative to the purpose of the crawl. To maximize W C, it is vital to guess the weight of the documents behind harvested links before download, such that documents with poten-tially lesser weight have a lower probability of being downloaded. So-called focused crawlers (in a broad sense) are designed to maximize W C with respect to some specific definition of document weight, for example when documents with a high search-engine relevance (measured as its Page-Rank or a similar score), documents about specific subjects, or documents in a specific language are desired (Chakrabarti et al., 1999;Menczer et al., 2004;Baykan et al., 2008;Safran et al., 2012). For our purpose, i. e., web corpus crawling, a document with a high weight can simply be defined as one which is not removed from the corpus by the post-processing tools due to low linguistic quality and/or a document which contributes a high amount of text to the corpus. Recently, an interesting approach to crawl optimization along such lines was suggested which relies on statistics about the corpus yield from known hosts (Suchomel and Pomikálek, 2012). Under this approach, the weight (rather of a whole web host) is taken to be the ratio of good documents from the host remaining in the corpus after a specific post-processing chain has been applied to the documents. Harvested URLs pointing to certain hosts are prioritized accordingly. We follow a similar route like Suchomel and Pomikálek, but look at documentlocal features instead of host statistics.
Throughout this paper, we refer to the yield ratio instead of W C, although they are related notions. We define the yield ratio Y d for a set D c of crawled unprocessed documents and a set D r of retained documents after filtering and processing for inclusion in a corpus, with D r ⊂ D c , as: For example, a document yield ratio Y d = 0.21 means that 21% of the crawled documents survived the cleaning procedure (i. e., were not classified as duplicates or spam, were long enough, written in the target language, etc.) and ended up in the corpus. In order to maximize Y d , 79% of the documents should not have been downloaded in the first place in this example. A parallel definition is assumed for Y b for the respective amounts of bytes. The document yield ratio is easier to interpret because the byte yield ratio depends on the amount of markup which has to be stripped, and which might vary independently of the quality of the downloaded web pages. Obviously, the yield ratio -like the weighted coverage -depends highly on the definition of what a good document is, i. e., what the goal of the crawl is. We assume, similar to Suchomel and Pomikálek's approach, that our tools reliably filter out documents that are interesting documents for inclusion a corpus, and that calculating a yield ratio based on the output of those tools is therefore reasonable. 1

Experiment 1: Seed and Crawl Quality
In this experiment, we examine the correlation between the yield ratio of crawler seed URLs and the yield ratio of short Breadth-First Search (BFS) crawls based on those URLs. We used the Heritrix (1.14) web crawler (Mohr et al., 2004) and an older version of the texrex web page cleaning toolkit (Schäfer and Bildhauer, 2012). The tools perform, among other things, boilerplate detection and text quality evaluation in the form of the so-called Badness score . A document receives a low Badness score if the most frequent function words of the target language have a high enough frequency in the document. The Badness score is based on previous ideas from language identification and web document filtering (Grefenstette, 1995;Baroni et al., 2009).
Originally, this experiment was carried out in the context of an evaluation of sources of different seed URLs for crawls. In a preliminary step, we began by collecting seed URLs from various sources: 1. the DMOZ directory 2. the Etools meta search engine 3. the FriendFeed social service aggregator 4. the identi.ca social bookmarking service 5. Wikipedia dumps We scraped the content behind the URLs and ran a state-of-the-art language identifier (Lui and Baldwin, 2012) on it in order to obtain languageclassified seed URLs (Barbaresi, 2013). 2 We then looked specifically at the following languages associated as the single dominant language with at least one top-level domain (TLD): We randomly sampled 1, 000 seed URLs for each of the 20 permutations of seed sources and languages/TLDs, downloaded them and used texrex to determine the document yield ratio for the documents behind the 1, 000 seeds. The software was configured to perform boilerplate removal, removal of documents based on high Badness scores, perfect duplicate removal, and deletion of documents shorter than 1, 000 characters (after boilerplate removal). Then, we crawled the respective TLDs, starting the crawls with the 1, 000 seed URLs, respectively. In each crawl, we downloaded 2 GB of raw data, cleaned them, and calculated the document yield ratio using the same configuration of texrex as we used for cleaning the seed documents. Figure 1 plots the data and an appropriate linear model. We see that there is a strong correlation (adjusted R 2 = 0.7831) between the yield ratio of the documents behind the seed URLs and the yield ratio of the documents found by using the seeds for BFS crawling. It follows that giving high priority to links from pages which are themselves considered high-quality documents by the postprocessing tools will likely lead to more efficient crawling. Since there is no fundamental distinction between initial URL seeds and URLs harvested at a later time during the crawl, this effect is likely to extend to the whole run time of a crawl.

Experiment 2: Crawling with Cyclic URL Selection
Using the same configuration of tools as in Section 2, we performed a crawl targeting Flemish documents in the Belgian .be national TLD, which hosts both Flemish and French documents in substantial proportions. Usually, even under more favorable conditions (i. e., when we crawl a TLD which contains mostly documents in the target language), the yield ratio of a BFS crawl decreases rapidly in the initial phase, then staying at a low level (Schäfer and Bildhauer, 2013, p. 31). Figure 2 illustrates this with an analysis of a .de BFS crawl from late 2011, also processed with the same tools as mentioned in Section 2. Notice that the .de domain hosts German documents almost exclusively.
The interesting complication in this experiment is thus the non-target language present in the TLD scope of the crawler and the related question whether, simply speaking, predominantly Flemish documents link to other predominantly Flemish documents rather than French documents. Since the Badness score (calculated as described in Section 2) includes a form of language identification, the yield ratio takes into account this additional complication.
We tested whether the decline of the yield ratio could be compensated for by selecting "high quality" URLs in the following manner: The crawl progressed in five phases. In the first short burnin phase, we crawled 1, 000, 000 documents, and in each of the second to fifth phase, we crawled 10, 000, 000 documents. After each phase, the  (Schäfer and Bildhauer, 2013, p. 31) crawl was halted, the crawler frontier was emptied, and the crawl was then re-started with a selection of the URLs harvested in the previous phase. Only those URLs were used which came from documents with a Badness score of 10 or lower (= documents in which the distribution of the most frequent function words fits the expected distribution for Flemish very well, cf. Section 2), and from text blocks with a boilerplate score (Schäfer and Bildhauer, 2012) in [0.5, 1] (= likely not boilerplate). Additionally, it was made sure that no URLs were re-used between the five phases. The very promising results are plotted in Figure 3.   Table 1: Fit of linear models for the decrease in the yield ratios of the first 100 snapshots in each of the five phases of the .be crawl. For the first phase, only 50 snapshots were crawled and fitted.
The decline of the yield ratio is almost linear for the first 100 snapshots in the five phases (cf .  Table 1), where each phase has roughly 500 snapshots in total, and one snapshot corresponds to 400 MB of downloaded raw data. After this decline, the yield ratio remains at low levels around 0.05. Cyclic URL selection, however, repeatedly manages to push the yield ratio to above 0.2 for a short period. The subsequent sharp decline shows that link selection/prioritization should rather be implemented in the crawler frontier management in order to achieve a constant effect over longer crawls (cf. Section 5).

Experiment 3: Internal Crawl Analysis
For the last experiment, we used the most recent version of the texrex toolkit, which writes full link structures for the processed documents as a by-product. 3 An internal analysis of a small portion of a crawled data set from the German TLD was performed, which is part of the raw material of the DECOW corpus (Schäfer and Bildhauer, 2012). The data set contains 11, 557, 695 crawled HTML documents and 81, 255, 876 http links extracted from the crawled documents (only <a> tags). Among the link URLs in the sample, 711, 092 are actually links to documents in the sample, so we could analyze exactly those 711, 092 links. It should be noticed that we only looked at links to different hosts, such that hostinternal links (navigation to "Home", etc.) are not included in the analysis.
In this experiment, we were interested specifically in the many documents which we usually discard right away simply because they are either very short (below 2 KB of unstripped HTML) or perfect duplicates of other documents. This is a 3 The new version (release name hyperhyper) has been released and documented at http://texrex.sf.net/.   Figure 4: Badness scores of the links in the crawl analysis described in Section 4. The x axis shows the Badness scores of the documents which linked to the retained ("good") and the deleted ("bad") documents. The y axis shows the proportion of retained/deleted documents for which the Badness score is ≥ x. (Lower Badness scores are better.) The observable correlation between the quality of a link's context and the quality of the page behind the link is stronger for the boilerplate score than for the Badness score. For example, had we only followed links from documents with a Badness score of 10 or lower (= better), then   Tables 2 and 3 show a confusion matrix for a reasonable Badness threshold (10) and a reasonable boilerplate threshold (0.5). Obviously, if we use Badness and boilerplate scores of the link context to make a binary download decision, the accuracy is much too low, which is why we suggest to merely prioritize URLs instead of discarding them, cf. Section 5.

Conclusion and Planned Crawler Architecture
We have shown that two standard cleaning algorithms used in web corpus construction, i. e., text quality evaluation based on frequent short words and boilerplate detection (as implemented in the texrex toolkit) have a high potential for optimizing web corpus crawling through the prioritization of harvested URLs in a crawler system.
We are now in the process of designing a custom web corpus crawler system called HeidiX, which integrates the texrex post-processing tools for weight estimation based on the methods described in this paper. Cf. Figure 6, which schematically shows the current design draft. 5 HeidiX is designed with a system of ranked URL back queues for harvested links (cf. UrlQueues). Each queue holds URLs for which the weight estimation is within a specifiable interval, such that the most promising URLs are in one queue, etc. The actual downloading is performed by massively parallel fetcher threads in the FetcherPool, which (in the final software) will talk to a DNS cacher and a politeness manager, which handles caching of Robots Exclusion Information and politeness intervals. The fetcher threads pop URLs from one of the ranked queues, which is selected randomly with prior probabilities inversely proportional to the rank of the queue. Thus, promising URLs are popped more often and less promising ones less often.
For guessing the weight, pluggable modules can be used and combined in the Focused-Walker container. Currently, we have the standard UrlSeenFilter, which is based on our own self-scaling Bloom Filter implementation (Bloom, 1970;Almeida et al., 2007), and which prevents any URL from being queued more than once. We have plans for a URL-based language guesser (Baykan et al., 2008) in the form of the LanguagePredictor, and a prioritizer based on the yield from specific hosts as described in Suchomel and Pomikálek (2012) in the form of the HostYieldPrioritizer, which reads statistics directly from the texrex module. The texrex module extracts all hyperlinks from processed documents and tags them with the quality scores described in this paper, such that the QualityPrioritizer module can adjust the expected weight of the document behind each URL.
The HeidiX architecture also features an alternative queueing strategy in the form of the RandomWalker, which allows users to obtain uniform random samples from the web based on existing algorithms (Henzinger et al., 2000;Rusmevichientong et al., 2001). Since obtaining such samples is a goal which is mostly orthogonal to the one assumed in this paper, we do not discuss this further here. Finally, a SnapshotKeeper module allows users to halt and continue crawls by writing/reading the current state of the relevant components to/from disk.
We hope that HeidiX will become a valuable tool in both the efficient construction of very large web corpora (FocusedWalker) and the construction of smaller unbiased reference samples as well as web analysis (RandomWalker).