M INING L EGACY I SSUES IN O PEN P IT M INING SITES : I NNOVATION & S UPPORT OF R ENATURALIZATION AND L AND U TILIZATION

and actively learn their representation in our dataset. We evaluate the OCR, Active Learning and Text Classiﬁcation separately to report the performance of the system. Active Learning and text classiﬁcation results are twofold: Whereas our categories about restrictions work sufﬁcient ( > .85 F1), the seven topic-oriented categories were complicated for human coders and hence the results achieved mediocre evaluation scores ( < .70 F1).


Introduction
In many parts of the world, raw materials were mined in open pit mines during the last century leaving many of these regions inhospitable or uninhabitable. To put these regions back into use, entire stretches of land must be renaturalized. Some sustainable concepts here are agricultural use, recreational areas for the inhabitants of these regions, industrial parks, solar and wind farms, or as building land. For the sustainable subsequent use or transfer to a new primary use, many contaminated sites and soil information have to be managed. Not all areas are suitable, usage of some must be prohibited and some can be used immediately.
In our paper, we refer to a real-world application of post-mining management of former lignite open pit mines in the eastern part of Germany. As a rule, land users must inquire with an authority, whether restrictions, utilization concepts and foundations for planning and building processes for these areas are documented in existing records. This process is lengthy and delays the subsequent use of these areas. We address this issue by demonstrating and evaluating a workflow consisting of Optical Character Recognition (OCR), Text Classification (TC), Active Learning (AL), and an integration with a Geographic Information System (GIS) to provide a dynamic and fast information situation. In detail, efficient information and documentation systems support the renaturation efforts, a faster transfer to the population and recreation of a healthy environment which fulfills the OECD demand for several SDGs. Those goals can be achieved much more precisely with a GIS-based improvement of land monitoring Avtar et al. [2020]. Moreover, it makes the use of those landscapes safer for all interest groups and stakeholders. In this context, it depends on the respective GIS application which objectives are favored. In the context of our application-German lignite mining landscape-the goals Zero Hunger (2), Clean Water and Sanitation (6), Sustainable Cities and Communities (11), Climate Action (13), Life below water (14) and Life on land (15) are notably relevant. 1 The SDGs in Germany are implemented decentralized which is based on the federal structure of the Federal Republic Government [2016]. Accordingly, the implementation of the SDGs is the responsibility of the federal government, as well as the states, which are coordinating with each other. This is then passed on to the municipalities in accordance with the principle of subsidiarity, which set up local implementation strategies. Currently, information about former lignite open pit mines exists in independent GIS systems, related unstructured documents are stored in dedicated storage systems (either in form of piles of paper or scanned but usually without further text extraction), and the connection of the documents to the area information are stored in additional databases. In order to respond to requests for e.g. no-use or hazard information, all information must be contextualized, compiled, and manually evaluated. In addition, administrations often work with isolated solutions making the interoperability and efficient use of spatial data more difficult and challenging Zern-Breuer et al. [2020], especially for small municipalities. Our proposed workflow will greatly improve the information situation and usability of the information, and the GIS implementations can emerge in efficient and complete information systems or Web Services for post-mining regions in Germany. This will accelerate the after-use and addresses the shortage of limited resources and employees in the organizations entrusted with managing these areas ultimately supporting municipalities in implementing the concerned SDGs.

Foundations and Related Work
Sustainability issues have long been politically ignored, but became much more relevant in recent years. As a result, this societal challenge has started to get traction in computer science Gomes et al. [2019], and also in natural language processing in the recent past Conforti et al. [2020]. To increase the success of sustainability projects, Conforti et al. [2020] classify user-perceived values on unstructured interview text to gather structured data about the people's subjective values. It is used in developing countries to ensure projects targeted at one or more SGD goals will be continued by the community after their initial implementation. This work also performs sentence classification for achieving sustainable goals, however, besides the difference of a completely different data domain, we treat the task as two multi-label classifications, use more recent transformer-based models, and integrate additional geo-spatial information. Pincet et al. [2019] describe the development and evaluation of an official OECD API to support the automatic classification of SDGs in reporting documents. This clearly indicates a demand for such tools, however, the classification of SDGs addressed in the text is not our main focus since the described workflow will support the There are a variety of OCR engines available with Tesseract Smith [1987] being a good starting point. The tool itself offers a number of additional pre-processing mechanisms for document images. By far, it does not implement the full range of the state of the art in OCR. Pre-processing of the images as proposed and implemented by the OCR-D project Binmakhashen and Mahmoud [2019], Neudecker et al. [2019], is required to additionally extend the tool with the latest developments in OCR processing. However, which pre-processing steps and which methods lead to the best results is strongly data dependent and is evaluated and documented in Section 5. In recent years, TC, like many other fields, has experienced a paradigm shift towards transformer-based models Vaswani et al. [2017], Devlin et al. [2019], which raised the state-of-the-art results on many tasks. Despite the impressive performance gains, the importance of pre-trained models is that their performance can be translated to low-data scenarios, which was previously impossible due to deep models overfitting on small data, and therefore has a large practical importance. Geographic information systems are a common technological choice to visualize spatial data on cartographic maps. GIS can combine a database storing textual information with the spatial data to support experts in the decisionmaking process or to enable the exploration of data. A recent toolbox of the commercial ARCGIS software called LocateXTESRI can connect a larger number of unstructured datasets into a running GIS. However, although it can automatically link information and coordinates from the data, it does not support the extraction and processing of unstructured information and other attributes providing further information. Hence, we have to provide and innovate the capability by OCR and TC processes.

Data
We use data from the Lausitzer und Mitteldeutsche Bergbauverwaltungsgesellschaft mbH (LMBV) 2 , who are responsible for the management and renaturation of abandoned mining sites. Moreover, they are obligated to provide reliable information about the managed lands for the public on request. Such requests require, among others, to inform about any restrictions for the specific area, which can be found in the associated documents.
For our research, the LMBV provided us with 31,605 of such documents (16,883 for the region Lausitz 3 , 14,722 for the region Mitteldeutschland 4 ). The data is of different age reaching back to the 1960s. However, the scans where produced within the last 20 years. Depending on the age of the documents and the time the scan was produced, the quality of the scans varies from excellent to fair quality. Additionally, some documents are stored in other digital formats (.doc, .docx, .odf) and are therefore not scanned. Besides the documents, our dataset contains over 30,000 geographic features. These features are described as points, lines, polygons and multipolygons, and can be visualized in a GIS. In addition, these data are provided with additional non-spatial information, such as the geographical affiliation of the documents mentioned.

OCR
To detect the text in our digitized dataset, we utilized an OCR process chain based on Tesseract 5 and best practices from the OCR-D community Smith [1987], Neudecker et al. [2019], Binmakhashen and Mahmoud [2019] for documents where the text was not accessible in the digital copy. For this process the major challenges are: (i) The vast number of documents make it infeasible to fine tune the OCR for each document. It has to be optimized over the whole collection.
(ii) There is no manually transcribed evaluation data for the OCR process. (iii) The documents are written by humans without any review process making erroneous words or grammar very likely. Since OCR strongly depends on the heterogeneous document formats in different parts of the world, the optimization of the challenges is not the focus of this work. Nevertheless, we applied best practices for OCR processing for German language. We use the build-in Tesseract evaluation procedure to judge the overall quality of the process and apply further data filtering to cope with difficult documents and bad OCR results. OCR pre-processing steps for the images included orientation analysis and rotation, resizing of the image (400dpi), denoising, lightning intensity correction, binarization, and deskewing. Lightning intensity correction only improved the result in some cases but worsen the result in others. It is therefore only used if it improves the result based on the confidence score from Tesseract as explained below. Denoising converts the images into grayscales, applies a dilution filter, an erosion filter, and, at last step, a median blur filter. The quality of the results is measured by evaluating the confidence score as produced by Tesseract which provides a word level confidence score reflecting the OCR quality. We aggregated the word level confidence score to a page level confidence score by averaging over all recognized words. Pages without recognized text are scored with 0. Again, the document quality and layout is very heterogeneous and annotating a test set for the OCR process would lack completeness. We identified 45,141 (Mitteldeutschland) and 35,256 (Lausitz) pages in the document dataset. We accept all pages for our experimental dataset which are evaluated with a confidence score of more than 75%. Without pre-processing only 45% of the pages are detected with a confidence score of at least 75%. The correct pre-processing improved the OCR result to 97% of pages exceeding the defined threshold for the region Lausitz (from 44% to 93% for region Mitteldeutschland, respectively). In 93% (Lausitz) and 83% (Mitteldeutschland) of the pages with a confidence below 75% the original documents do not contain any recognizable text. Hence, the majority of unrecognized or insufficiently recognized documents does not contain proper amounts of text. In more detail, the pages without recognizable text originate from documents containing maps with symbols, legends or photos. Without the proper pre-processing, the symbols, parts of the legends, or the reference region names are extracted but are not useful for answering requests in general.

Data processing
To successfully extract information from the documents, the text has to be extracted, segmented into sentences and further pre-processed linguistically for TC. The text extraction depends on the quality of the OCR process described in detail above. In this work, we focus on the classification of restrictions and associated topics. Restriction can be of manifold formulations, e.g., a specific usage can be forbidden, a usage may require certain actions to be allowed, or usage may be explicitly allowed under certain circumstances while implicitly not allowed otherwise. Moreover, a restriction may associate certain topics such as weather indicated restrictions or construction-related restrictions. In Table 1, we give a detailed overview of the aspired category system. Those labels are deductively defined and reflect the requirements of the most frequent requests to the LMBV organization. The set of labels is not fixed yet but poses a reasonable basis for building a classification system. Since the category system is specifically defined for this novel approach no training or pre-labeled data can be provided by the LMBV. Hence, examples and possible keywords were defined for the categories in order to describe their properties. The labels for our experiments were created as follows: Keywords, as shown in the Appendix, were used to locate restrictions and prohibition candidates. From this candidate pool, we select candidates for the sub-categories utilizing further keyword matching. In doing so, we take a maximum of 150 examples per sub-category. If no more than 300 candidates can be found for a sub-category, we only include half of them in the candidate list to leave examples of such rare categories for AL in our unlabeled dataset. Since we want to demonstrate the capabilities of AL this is a necessary decision. Additionally, we added more than 700 randomly drawn sentences to result in a dataset of 2000 sentences. This dataset was annotated by three different annotators. We measured the inter-annotator agreement with Krippendorff's α Krippendorff [2011] which resulted in values between 0.91 and 0.7, with Restricted Area as the most agreed label and Construction the least agreed label between annotators. This confirms our observation that labeling in this domain is challenging and needs domain expert knowledge. To finally construct our Evaluation and Experiment data, we use majority voting on the annotations to filter the noise judgments of our non-expert annotators Nowak and Rüger [2010]. Requirement is the most frequent label, while weather is the rarest (see Appendix, Table 2). The datasets true label distribution is unknown, but at least some of the

Restrictions Prohibition
Statements which actively prohibit or restrict actions generally or conditionally.
Machines heavier than 30t are forbidden, landslide hazard.

Requirement
Requirements limit usages and/or are directives what need to be done.
The area must be secured with 'no trespassing'signs.

Topics Weather
Weather-related phenomena, consequences, and protection measures.
Shore areas must be avoided during heavy rain.

Construction
Statements about construction plans, construction sites, or construction procedures.
Only one-storey buildings should be placed around the marina.

Geotechnics
Information related to the ground, e.g., soil, stability, or slopes.
Slopes must be protected against the effects of the weather.
Restricted area Indicates a limited accessibility, mostly due to hazards, soil stability, or safety precautions.
Always keep a distance of at least 50m to the shore.
Planting Plans, reports, or specific details about the type of plant and location of plantings.
Native species of bushes must be planted on the slope, to stabilize it against rupture.

Environment
In renaturalisation, it is often strictly regulated where to plant, what, types etc.
During breeding season are forest operations forbidden.

Disposal
Instructions concerning storage and disposal of (building) materials.
Contaminated soil must be cleaned and provably be disposed of. labels seem to be very sparse. It is expected that the majority of the sentences in the complete dataset will not receive an assignment of a valid category. We split our annotated dataset into training (500 samples), validation (500 samples), test (1000 samples) using iterative stratification Sechidis et al. [2011] to maintain the label distribution in all three sets (see Appendix, Table 2). As this constructed dataset is intended to study an AL approach, the training set is only used as initial data and is therefore rather small.

Approach
The goal of our approach is to detect labels from the classes of both, restrictions and topics, at the sentence level in physical legacy documents. As a result, we can map these labels to geospatial data, which were already digitized in the past. This means, with a process chain of OCR, TC, and GIS, we can effectively detect the presence or absence of labels of interest at geographic coordinates. In the end, this can be directly used to manage renaturation efforts, thereby supporting the aforementioned SDGs. As existing OCR solutions are tried and tested, and the geospatial link is already given, the main challenge of this method is the TC, which will therefore be the focus in our experiments. The main challenges of this TC task are: (i) There is no predefined industrial standard for the labels which are not formally defined but given by some exemplary formulations and keywords provided by the LMBV. Consequently, the defintions are incomplete and new formulation not using the keywords are expected. (ii) The documents exhibit a very domain-specific, often convoluted, vocabulary. In the following we describe the whole process chain, with the exception to OCR, which has already been explained in Section 3.1.

Sentence Classification and Active Learning
Our TC operates on the data described in Section 3. Due to the text layouts, we detected hyphenated words, converted line breaks into white space, and finally trimmed repeated sequences of white space. Subsequently, word and the sentence segmentation was performed using syntok 6 . In order to clean the noise introduced by the OCR step (some sentences still contain a lot of noise, ranging from single misclassified characters to entire gibberish words) and other text which does not constitute a normal sentence (such as addresses, table headlines etc.) we filtered sentences which violate the properties of a valid sentence Goldhahn et al. [2012]. This was achieved by a set of regular expressions and filter rules, which detect improper sentences, e.g., sentences which contain too many special characters, start with a lowercase letter, or are missing a terminal punctuation character. In contrast to standard TC datasets, most real-world data, like the LMBV data, usually provides no labels. Manually labeling documents, however, is time-consuming and therefore costly. AL Lewis and Gale [1994] comes into play, where no labeled data is available: In an iterative process the active learner presents unlabeled data to a user, which then the user has to label. The purpose of this is to reduce the total labeling effort, by identifying samples that add the most value to the current model. The key for this is the query strategy, which decides upon the examples to be labeled by the user.After labeling the presented samples, a new model is trained, and the loop is repeated, either for a certain number of rounds, or until a stopping criterion is met. We assume the pool-based scenario Settles [2010], in which the active learner has access to all unlabeled data and only non-existing labels are investigated. This exactly matches our setup. Since no labels are provided, and the ratio of sentences having at least one label to irrelevant sentences is quite small, randomly sampling data is not an option, and AL is the obvious choice. The TC is realized using two independent classifiers for restrictions and topics (see Table 1). As the single labels under restrictions and topics are not mutually exclusive, we train a multi-hot-encoded multi-label classification. We trained two separate classifiers, since it is easier for the human annotator to focus on only a single set of labels during the AL process.

Geospatial Connection
The linkage of the non-spatial data and the spatial data can be represented as a graph by representing the data by nodes. Edges are then used to link the data elements that are in relation to each other.
The predicted labels can be integrated into the data model by the existing assignment of the documents to concrete coordinates with the following procedure. A node is created for each labeled sentence carrying the label description as node property. Restrictions that have not been classified more precisely by a topic are grouped together under a generic category. Then, edges created for each restriction, pointing from the node of the associated category to the document node from which the restriction originated. Additional information about the restrictions is available as attributes of the respective edges. This includes the sentence from which the restriction is derived and the confidence value from the TC algorithm. These attributes are in the edge instead of the document node itself, since a document can lead to several restrictions, either in the same category or in different categories (e.g., "large installations may not be built" [construction-related] and "may not enter shore areas during heavy rain" [weather, restricted area]). Many queries can be realized with this data structure. For example, it is possible to efficiently query which restrictions exist in the same topic, in the same document or in the same geographic area, since only the corresponding nodes need to be followed in the data model.

Experiments
We evaluate AL performed by three human annotators, who each learn a sentence classification model for classifying (i) restrictions and (ii) topics, resulting in two runs per person.

Pre-processing and Experimental Setup
Starting from the initial model, which is trained on the train set (described in Section 3), AL is performed iteratively: (i) 10 unlabeled sentences are presented to the annotator; (ii) The annotator may assign zero, one, or multiple labels per sentence; (iii) The newly-assigned labels are added to the train set, and a new model is trained. This process is repeated for 50 iterations.
Data We use train, validation and test splits, as defined in Section 3.2, and an unlabeled pool consisting of 312,299 sentences.
Query Strategy For the query strategy, which selects the sentences to be labeled, we use prediction-entropy-based Roy and McCallum [2001] uncertainty sampling Lewis and Gale [1994], which selects the most uncertain samples, e.g., in this case those whose predicted class posterior exhibits the highest entropy. Since inference on transformers is computationally expensive, and we aim to keep the waiting times at a minimum, at the beginning of each iteration, we subsample the whole unlabeled pool randomly by selecting 4096 examples. Moreover, because the ratio of unlabeled sentences to sentences having at least one label is quite large, we adapt the query strategy to balance classes, by considering the class predictions and sampling evenly over the labels. In case this is not possible, e.g. when there is no prediction for a certain label, we fill the remainder with the remaining most uncertain samples, regardless of the predicted class.

Model and Training
Regarding the classification, we fine-tune the pre-trained gbert-base Chan et al. [2020], which has of 110M parameters and is the best performing German transformer model for the task of TC regarding the number of parameters. While there is a larger gbert model available, we opted for the base variant due to its efficiency, which results in lower turnaround times of an AL step for the practitioner. We encode the multi-label targets as multi-hot encoded vectors. The model is trained using a softmax binary cross-entropy loss.
During each AL iteration the previous model is fine-tuned for 40 epochs using a learning rate of 5e −5 on the data that has been labeled to this point. To avoid overfitting, we apply early stopping, which is triggered as soon as the training accuracy crosses 98%, or the validation loss does not change for more than 5 epochs.  Table 2: Results of the active learning performed by three human annotators. "AVG." is the annotator average over all three runs. "F1 AL" shows the final scores, broken down by annotator. "F1 B." is a text classification baseline that is trained on the initial training set. Table 2 shows the classification scores of aforementioned setting evaluated by three annotators and compared to an automated TC baseline. The baseline was trained on the train set and reflects the AL's initial classifier, i.e. without using AL at all. AL improves overall both micro-F1 and macro-F1 for topics by about 2 percentage points, whereas we did not see any improvement for restrictions. While the overall result improves just slightly, looking at the single labels, we can see considerable changes between plain TC and AL. On the positive side, previously weaker labels like "Weather" and "Construction" improve by 5 to 21 percentage points in F1. Smaller improvements can also be seen for on "Restricted Area" and "Environment", and "Disposal" stays about the same. Unfortunately, "Planting" and "Geotechnics" also drops in performance by 5 to 8 percentage points. Interestingly, when we compare the difference in the relative quantities of co-occurring labels before and after the AL process, we find that the labeled pool changed notably during AL. We observed that i) the average number of labels per sentence increases; ii) label co-occurrences shift considerably and some combinations even appear for the first time; iii) every combination of topic labels occurs together in the data, which is not the case for our keyword-bootstrapped train set(see Appendix, Figure 1-3). All in all, this indicates that AL is beneficial and improves classification metrics by a small amount, and moreover, many samples with previously rarely or even unseen label combinations are found. Apparently, as these notable changes only lead to a small difference in F1, this new value of having more diverse label combinations is difficult to measure here against our keyword-bootstrapped test set. The only solution to a more representative test set, however, would require massive annotation efforts, since labels may be very sparse.

Visualization and Interaction Use Case
As an example, we present a workflow regarding areas that may not be entered during heavy rains for safety reasons. To answer a request (see Section 3) of whether this area may currently be entered, the expert uses the GIS, centers the map on the corresponding area, and displays the associated features (e.g., active dismantling areas, see Figure 2 A). To enable the expert to analyze the different categories of data, the displayed elements are colored by category as suggested by Ware [2012]. Since areas can overlap in the map display, all elements are colored only semi-transparently. Information immediately prohibiting certain activities are identified by clicking on an element, which displays the non-spatial data in an information panel (Figure 2 A-B). To keep the expert's overview of the selected elements, all selected elements are represented with a striped texture. All restrictions that result from the documents linked to the selected element are listed (Figure 2 C). The entries are grouped by the restriction type and sorted by a confidence value (Figure 2 C 1 and C 2 ). The document title and the sentence from which the restriction is derived are indicated. A click on the title opens a new window for reading the document. This list provides the expert with direct feedback on which documents might be relevant and, without reading them completely, an overview of which usage restrictions are present. This information is crucial for the experts, as it can have a significant impact on planned projects and their planning time. Additionally, the area described by a document can be superimposed with weather data. In this way, decisions regarding conditional restrictions (e.g., "may not enter shore areas during heavy rain") can also be made more quickly and directly on the basis of the system. The selected elements overlaps with a heavy rain area represented by isobands, therefore the request is directly answered and access to the area is currently prohibited (Figure 2 A). However, for other restrictions, it is possible that more information is necessary because experts often compare the region of interest with similar regions. Therefore, we provide a filter to highlight all elements within the same category (Figure 2 C). By analyzing similar regions, the expert can derive recommendations for action, which might be necessary for the renaturation of an area. s/he can also derive recommendations for possible usage restrictions. Furthermore, this comparison can prevents actions from not being taken or from being taken too late, because the current information does not make them appear necessary, but it is clear from similar projects that they may nevertheless become necessary. This leads to a safe and quick renaturation and after-use of regions maintained in that manner since precautions can be taken in advance.

Summary and Future Work
In this work, we have designed and evaluated a system that advances the level of automation for managing unstructured data in information systems for managing former open pit coal mines to a new level. We have shown that technologies from NLP, visualization and GIS can interact via appropriate data models. As a result, maintainers of former mines in the eastern part of Germany are now able to identify possible usage restrictions within a very short time. Furthermore, the risk of missing important information is lowered. The main goal of the renaturation of former open pit mines can be efficiently supported so that rapid reuse is promoted in these regions. Since this problem exists in many countries of the world, we are convinced that such work is an essential contribution to the implementation of important SDGs in these countries. However, the individual and heterogeneous situations regarding the different document layouts poses for a research direction where more OCR related research needs to be applied. This includes layout detection, table recognition, fast model adoption to legacy documentation situations, and robust classifiers for large scale applications and a heterogeneous field of possible topics. AL can help to build a strong TC in finding important samples not seen before, when fighting the challenges of unlabeled data and sparse labels. Further research is needed to shift recall to 100% to hit every example and then correcting false positives in the system.

Ethical Considerations
The results of our work provide a workflow for automatic information extraction in mining, planning and nature conservation related reports. The collected information represents access restrictions and concerning issues.
We are aware that misclassification in the application can lead to people being endangered or prevented from entering these regions for no reason. Misuse cannot be ruled out, but no specific example is known.
To ensure that these effects do not impact the stakeholders of the application, a quality assurance process will be used in the operating company so that employees in the piloting phase manually check where errors or information losses can be detected. In addition, there will be quality assurance for the application so that hazards are eliminated as far as possible. Furthermore, our results could in theory lead to a decline in employees needed to read and check old documents, possibly resulting in job losses. Our scenario, however, requires specialists, who are not easily replaceable.