Handling Extreme Class Imbalance in Technical Logbook Datasets

Technical logbooks are a challenging and under-explored text type in automated event identification. These texts are typically short and written in non-standard yet technical language, posing challenges to off-the-shelf NLP pipelines. The granularity of issue types described in these datasets additionally leads to class imbalance, making it challenging for models to accurately predict which issue each logbook entry describes. In this paper we focus on the problem of technical issue classification by considering logbook datasets from the automotive, aviation, and facilities maintenance domains. We adapt a feedback strategy from computer vision for handling extreme class imbalance, which resamples the training data based on its error in the prediction process. Our experiments show that with statistical significance this feedback strategy provides the best results for four different neural network models trained across a suite of seven different technical logbook datasets from distinct technical domains. The feedback strategy is also generic and could be applied to any learning problem with substantial class imbalances.


Introduction
Predictive maintenance techniques are applied to engineering systems to estimate when maintenance should be performed to reduce costs and improve operational efficiency (Carvalho et al., 2019), as well as mitigate risk and increase safety. Maintenance records are an important source of information for predictive maintenance (McArthur et al., 2018). These records are often stored in the form of technical logbooks in which each entry contains fields that identify and describe a maintenance issue (Akhbardeh et al., 2020a). Being able to classify these technical events is an important step in the development of predictive maintenance systems.
In most technical logbooks, issues are manually labeled by domain experts (e.g., mechanics) in free text fields. This text can then be used to classify or cluster events by semantic similarity. Classifying events in technical logbooks is a challenging problem for the NLP community for several reasons: (a) the technical logbooks are written by various domain experts and contain short text entries with nonstandard language including domain-specific abbreviated words (see Table 1 for examples), which makes them distinct from other short non-standard text corpora (e.g., social media); (b) off-the-shelf NLP tools struggle to perform well on this type of data as they tend to be trained on standard contemporary corpora such as newspaper texts; (c) outside of the clinical and biomedical sciences, there is a lack of domain-specific, expert-based datasets for studying expert-based event classification, and in particular few resources are available for technical problem domains; and (d) technical logbooks tend to be characterized by a large number of event classes that are highly imbalanced.
Original Entry Pre-processed Entry fwd eng baff seeal needs resecured.
forward engine baffle seal needs resecured. r/h eng #3 intake gsk leaking. right engine number 3 intake gasket leaking. bird struck on p/w at twy. bird rmvd.
bird struck on pilot window at taxiway. bird removed location rptd as nm from rwy aprch end. location reported as new mexico from runway approach end. Table 1: Original and text-normalized example data instances illustrating that domain-specific terms (baffle), abbreviations (gsk -gasket, eng -engine), and misspellings (seeal -seal) are abundant in logbook data.
We address the aforementioned challenges with a special focus on exploring strategies to address class imbalance. There is wide variation in the number of instances among the technical event classes examined in this work, as shown in Figure 1 and Ta- ble 3. This extreme class imbalance is an obstacle when processing logbooks as it causes most learning algorithms to become biased and mainly predict the large classes (Kim et al., 2019). To overcome this issue, we introduce a feedback loop strategy, which is a repurposing of a method used to address extreme class imbalance in computer vision (Bowley et al., 2019), and examine it for classification of textual technical event descriptions. This technique is applied in the training of a suite of common classification models on seven predictive maintenance datasets representing the aviation, automotive, and facility maintenance domains. This paper addresses these research questions: RQ1: To which extent does the class granularity and class imbalance present in technical logbooks impact technical event classification performance, and can a feedback loop for training data selection effectively address this issue? RQ2: Which classification models are better suited to classify technical events for predictive maintenance across logbook datasets representing different technical domains? The main contributions of this work include: 1. Experimental results showing strong performance of the feedback loop in addressing the class imbalance problem in technical event classification across all datasets and models; 2. A thorough empirical evaluation of the performance of the technical event classifier considering multiple models and seven logbook datasets from three different domains.

Related Work
Most expert-domain datasets containing events have focused on healthcare. For instance, Altuncu et al. (2019) analyzed patient incidents in unstructured electronic health records provided by the U.K. National Health Service. They evaluated a deep artificial neural network model on the expertannotated textual dataset of a safety incident to identify similar events that occurred. Deléger et al. (2010) proposed a method to deal with unstructured clinical records, using rule-based techniques to extract names of medicines and related information such as prescribed dosage. Savova et al. (2010) considered free-text electronic medical records for information extraction purposes and developed a system to obtain clinical domain knowledge. Patrick and Li (2009) proposed the cascade methods of extracting the medication records such as treatment duration or reason, obtained from patient's historical records. Their approach for event extraction includes text normalization, tokenization, and context identification. A system using multiple features outperformed a baseline method using a bag of words model. Yetisgen-Yildiz et al. (2013) proposed the lung disease phenotypes identification method to prevent the use of a handoperated identification strategy. They employed NLP pipelines including text pre-processing and further text classification on the textual reports to identify the patients with a positive diagnosis for the disease. Based on the outcome, they achieve   (3), automotive safety (4), and facility maintenance (5). Each instance shows how domain-specific terminology, abbreviations (Abbr.), and misspelled words (in bold font) are used by the domain expert, and also illustrates some of the event types covered. More details are provided in Section 3.
notable performance by using the n-gram features with the Maximum Entropy (MaxEnt) classifier.
There is also relevant research on event classification in social media. For example, Ritter et al. (2012) proposed an open-source event extraction and supervised tagger for noisy microblogs. Cherry and Guo (2015) applied word embedding-based modeling for information extraction on news-wire and tweets, comparing named entity taggers to improve their method. Hammar et al. (2018) performed experimental work on Instagram text using weakly supervised text classification to extracted clothing brand based on user descriptions in posts.
The problem of class imbalance has been studied in recent years for numerous natural language processing tasks. Tayyar Madabushi et al. (2019) studied automatic propaganda event detection from a news dataset using a pre-trained BERT model. They recognized that the BERT model had issues in generalizing. To overcome this issue, they proposed a cost-weighting method. Al-Azani and El-Alfy (2017) analyzed polarity measurement in imbalanced tweet datasets utilizing features learned with word embeddings. Li and Nenkova (2014) studied the class imbalance problem in the task of discourse relation identification by comparing the accuracy of multiple classifiers. They showed that utilizing a unified method and further downsampling the negative instances can significantly enhance the performance of the prediction model on unbalanced binary and multi-classes.
Dealing with unbalance classes is also studied well in the sentiment classification task. Li et al. (2012) introduced an active learning method that overcomes the problem of data class unbalance by choosing the significant sample of minority class for manual annotation and majority class for automatic annotation to lower the amount of human annotation required. Furthermore, Damaschk et al. (2019) examined techniques to overcome the problem of dealing with high-class imbalance in classifying a collection of song lyrics. They employed neural network models including a multi-layer perceptron and a Doc2Vec model in their experiments where the finding was that undersampling the majority class can be a reasonable approach to remove the data sparsity and further improve the classification performance.  also explored the problem of high data imbalance using cross-entropy criteria as well as standard performance metrics. They proposed a loss function called Dice loss that assigns equal importance to the false negatives and the false positives. In computer vision, Bowley et al. (2019) developed an automated feedback loop method to identify and classify wildlife species from Unmanned Aerial Systems imagery, for training CNNs to overcome the unbalanced class issue. On their expert imagery dataset, the error rate decreased substantially from 0.88 to 0.05. This work adapts this feedback loop strategy to the NLP problem of classifying technical events.

Technical Event Datasets
In this work, we used a set of 7 logbook datasets from the aviation, automotive, and facility domains available at MaintNet (Akhbardeh et al., 2020a). MaintNet is a collaborative open-source platform for predictive maintenance language resources featuring multiple technical logbook datasets and tools. These datasets include: 1) Avi-Main contains seven years of maintenance logbook reports collected by  the University of North Dakota aviation program on aircraft maintenance that were reported by the mechanic or pilot. 2) Avi-Acc contains four years of aviation accident and reported damages. 3) Avi-Safe contains eleven years of aviation safety and incident reports. Accidents were caused by foreign objects/birds during the flights which led to safety inspection and maintenance, where safety crews indicated the damage (safety) level for further analysis. 4) Auto-Main is a single year report with maintenance records for cars. 5) Auto-Acc contains twelve years of car accidents and crash reports describing the related car maintenance issue and property damaged in the accident. 6) Auto-Safe contains four years of noted hazards and incidents on the roadway from the driver. 7) Faci-Main contains six years of logbook reports collected for building maintenance. These technical logbooks include short, compact, and descriptive domain-specific English texts single instances usually contain between 2 and 20 tokens on average including abbreviations and domain-specific words. An example instance from Table 2, r/h fwd upper baff seal needs to be resecured, shows how the instances for a specific issue class are comprised from specific vocabulary (less ambiguity), and therefore contain a high level of granularity (level of description for an event from multiple words) (Mulkar-Mehta et al., 2011). Table 3 presents statistics for each dataset, in terms of the number of instances, average instance length, number of classes, and the minimum, average, median and maximum class size to represent how imbalanced the datasets are.
An instance in the logbook can be formed as a complete description of the technical event (such as a safety or maintenance inspection) like: #2 & #4 cyl rocker cover gsk are leaking, or it might contain an incomplete description by solely referring to the damaged part/section of machinery (hyd cap chck eng light on) using few domain words. In either form of the problem description, the given annotation (label) is at the issue type-level, e.g., baffle damage. Table 2 shows multiple examples with associated instances.
Further characteristics of these log entries include compound words (antifreeze, engine-holder, driftangle, dashboard). Many of these words (e.g., a compound word: dashboard) essentially represent the items, or domain-specific parts used in the descriptions. Additionally, function words (e.g., prepositions) are important and removing them could alter the meaning of the entry. The logbook datasets also have both the following shared and distinct characteristics: Shared Characteristics: Each instance contains a descriptive observation of the issue and/or the suggested action that should be taken (eng inspection panel missing screw). Each instance also refers to maintaining a single event, which means the recognized problem applies to the only single-issue type. As an example, the instance cyl #1 baff cracked at screw support & forward baff below #1 includes a combination of sequences that refers to the location and/or specific part of the machinery.
Distinct Characteristics: In each domain, terminologies, a list of terms, and abbreviations are distinct, and an abbreviation can have different expansion depending on the domain context (Sproat et al., 2001), e.g., a/c can mean aircraft in aviation and in the automotive domain air conditioner. However, the abbreviations and acronyms of the domain words (e.g. atcair traffic control) in these technical datasets should not be approached as a word sense disambiguation problem as they require character level expansion.

Handling Class Imbalance
Collecting additional data to augment datasets is a common approach for tackling the problem of skewed class distributions. However, as discussed earlier, technical logbooks are proprietary and very hard to obtain. In addition, each domain captures domain-specific lexical semantics, preventing the use of techniques such as domain adaption (Ma Re-sampling Under-and over-sampling are resampling techniques (Maragoudakis et al., 2006) that were used to create balanced class sizes for model training. For over-sampling, instances of the minority classes are randomly copied so that all classes would have the same number of instances as the largest class. For under-sampling, observations are randomly removed from the majority classes, so that all classes have the same number of instances as the smallest class. For both approaches, we first divided our datasets into test and training sets before performing over-sampling to prevent contamination of the test set by having the same observations in both the training and test data.

Feedback Loop
To address class imbalances in text classification, this work adapts the approach in Bowley et al. (2019) from the computer vision domain. The goal of this approach is not only to alleviate the bias towards majority classes but also to adjust the training data instances such that the models are always being trained on the instances that was performing the worst on. It should be noted that this approach is very similar to adaptive learning strategies which have been shown to aid in human learning (Kerr, 2015;Midgley, 2014).
Algorithm 1 presents pseudocode for the feedback loop. In this process, the active training data (the data used to actually train the models in each iteration of the loop) is continually resampled from the training data. The model is first initially trained with an undersampled number of random instances from each class, which becomes the initial active training data. The model M then performs inference over the entire training set, and then selects MCS instances from each class C i which had the worst error during inference, where MCS is the minority (smallest) class size. The model is then retrained with this new active training data and the process of training, inference and selection of the MCS worst instances repeats for a fixed number of feedback loop iterations, FLI. In this way the model is always being trained on the instances it has classified the worst.
To measure the effect of resampling the worst performing instances, the feedback loop approach was also compared to a random downsampling (DS) loop, where instead of evaluating the model over each instance and selecting the worst performing instances, MCS instances from each class are instead randomly sampled. As performing inference over the entire training set adds overhead, a comparison to the random DS loop method would show if performing this inference is worth the performance cost over simple random resampling. This approach is the same as Algorithm 1 except that SampleRandom is used instead of Resample in the feedback loop. Section 4.3 describes how the number of training epochs and loop iterations were determined such that all the training data selection methods are given a fair evaluation with the same amount of computational time.
Evaluation Metrics For imbalanced datasets, simply using precision, recall or F1 score metrics for the entire datasets would not accurately reflect how well a model or method performs, as they emphasize the majority classes. To overcome this, alternative evaluation metrics to handle the class imbalance problem were used, as recommended by Banerjee et al. (2019). Specifically, we report the models performance based on precision, recall, and F1 score by utilizing a macro-average over all classes, as this gives every class equal weight, and hence reveals how well the models and training data selection strategies perform.

Model Architecture and Training
Different machine learning methods were considered for technical event/issue classification (e.g. engine failure, turbine failure). Each instance is an individual short logbook entry and contains approximately 2 to 20 tokens (12 words on average per instance including function words), as shown in Table 3 Deep Neural Network A deep artificial neural network (DNN), as described by Dernoncourt et al. (2017), can learn abstract representation and features of the input instances that would help to achieve better performance on predicting the issue type in the logbook dataset. The DNN used was a 3 layer, fully connected feed forward neural network with an input embedding layer of dimension 300 and equal size number of words followed by 2 dense layers with 512 hidden units with ReLU activation functions followed by a dropout layer. Finally, we added a fully connected dense layer with size equal to the number of classes, with a SoftMax activation function.
Long Short-Term Memory An LSTM RNN was also used to perform a sequence-to-label classification. As described by Suzgun et al. (2019) LSTM RNNs utilize several vector gates at each state to regulate the passing of data by the sequence which enhances the modeling of the long-term dependencies. We used a 3 layer LSTM model with a word embedding layer of dimension 300 and the equal size number of words followed by an LSTM layer with setting the number of hidden units equal to the embedding dimension, followed by a dropout layer. Finally, we added a fully connected layer with size equal to the number of classes, with a SoftMax activation function.
Convolutional Neural Network Convolutional neural networks (CNNs) have demonstrated exceptional success in NLP tasks such as document classification, language modeling, or machine translation (Lin et al., 2018). As Xu et al. (2020) described, CNN models can produce consistent performance when applied to the various text types such as short sequences. We evaluated a CNN architecture (Shen et al., 2018) with a convolutional layer, followed by batch normalization, ReLU, and a dropout layer, which was followed by a maxpooling layer. The model contained 300 convolutional filters with the size of 1 by n-gram length pooling with the size of 1 by the length of the input sequence, followed by concatenation layer, then finally connected to a fully connected dense layer, and an output layer equal to the size of the dataset class using a SoftMax activation function.

Bidirectional Encoder Representations
We also evaluated using the pre-trained uncased Bidirectional Encoder Representations (BERT) for English (Devlin et al., 2019). We fine-tuned the model, and used a word piece based BERT tokenizer for the tokenization process and the RandomSampler and SequentialSampler for training and testing respectively. To better optimize this model, a schedule was created for the learning rate that decayed linearly from the initial learning rate we set in the optimizer to 0.

Experimental Settings
Datasets and Baselines First, the technical text pre-processing pipeline developed by Akhbardeh et al. (2020b) was applied, which comprises domain-specific noise entity removal, dictionarybased standardization, lexical normalization, part of speech tagging, and domain-specific lemmatization. We divided the datasets selecting randomly from each class independently to maintain a similar class size distribution, using 80% of the instances for training and 20% of the instances for testing data. For feature extraction, two methods were considered: a bag-of-word model (n-grams:1) (Pedregosa et al., 2011) and pre-trained 300 dimensional GloVe word embeddings (Pennington et al., 2014).
Hyperparameter and Tuning The coarse to fine learning (CFL) approach (Lee et al., 2018) was used to set parameters and hyperparameters for the DNN, LSTM, and CNN models. Experiments considered batch sizes of 32, 64, and 128, an initial learning rate ranging from 0.01 to 0.001 with a learning decay rate of 0.9, and dropout regularization in the range from 0.2 to 0.5 in all models, as well as ReLU and SoftMax activation functions (Nair and Hinton, 2010), categorical cross-entropy (Zhang and Sabuncu, 2018) as the loss function, and the Adam optimizer (Kingma and Ba, 2015) in the DNN, LSTM, CNN and BERT models. Based on experiments and network training accuracy, a batch size of 64 and drop out regularization of 0.3 was selected for model training.
Each model with each training data selection strategy was trained 20 times to generate results for each dataset. To ensure each training data selection strategy was fairly compared with a similar computational budget, the number of training epochs and loop iterations (if the strategy had a feedback or random downsampling loop) were adjusted so that the total number of training instances evaluations each model performed was the same. For each dataset, the number of forward and backward passes, 'T' for 100 epochs of the baseline strategy was used as the standard. As an example, Table 4 shows how many loop iterations, epochs per loop, and inference passes were done for each training data selection strategy on the Auto-Safe dataset. Given the differences between the min and max class sizes it was not possible to get exact matches but the strategies came as close as possible. We counted each inference pass for the feedback loop the same as a forward and backward training pass, which actually was a slight computational disadvantage for the feedback loop, as a forward and backward pass in training takes approximately 1x to 2x the time as an inference pass. Table 5 shows a comparison between the baseline and the four different class balancing methods (over-sampling, under-sampling, the random downsampling (DS) loop and the feedback loop). Based on these outcomes, the feedback loop strategy almost entirely outperforms the other methods over all datasets and models, showing that performing inference over the training set and reselecting the training data from the worst performing instances  does provide a benefit to the learning process. A plausible explanation is that this strategy does not introduce bias into the larger class and also does not effect the minority class size distribution. It also does not waste training time on instances the model has already well learned. Table 5 also shows the empirical analysis of the four classification models, with the model and training data selection strategy providing the overall best results shown in bold and italics. Using technical text pre-processing techniques described in Section 4.3, and the feedback loop strategy described in Section 4.1, the precision, recall, and F1 score improved compared to the baseline performance. The CNN model outperformed the other algorithms with improved precision, recall, and F1 score for almost all datasets except for Avi-Main, where BERT had the similar results, and Auto-Main where CNN and BERT tied. This is interesting, given the current popularity of the BERT model, however it may be due to the substantial lexical, topical, and structural linguistic differences between the technical logbook data and the English corpus that BERT was pre-trained on.

Results
Furthermore, we conducted the Mann-Whitney U-test of statistical significance by using the F1 scores of each of the 20 repeated experiments of the classification models, using the baseline and the feedback loop approach as the two different populations. The outcomes are shown in Table  6, with the differences being highly statistically significant.   garding the discussion provided in Section 3 about the nature of such a dataset, there are key challenges that effect the performance of employed algorithms. As discussed in Section 1, the extreme class imbalance observed in these technical datasets substantially affects learning algorithms' performance. To overcome this issue, we first explored oversampling and undersampling, which both result in balanced class sizes. Undersampling removed portions of dataset that could be important for certain technical events or issues, which resulted in underfitting and weak generalization for important classes. On the other hand, oversampling may introduce overfitting in the minority class, as some of the event types are very short tokens containing domain-specific words. Following this, to minimize the possibility of overfitting and underfitting, a random downsampling loop and a feedback loop were investigated to minimize bias in the training process. It was found that the added computational cost of the feedback loop inference was worth the reduction in training time it caused over the random downsampling loop.
The scarce data available in a dataset such as Auto-Main is certainly an issue for deep learning methods. Examining the accuracy improvement by using the proposed feedback loop strategy, requires incorporating more instances to the event classes. Similar to any supervised learning models, we noticed some limitations that could be addressed in future work. As shown in the previous sections (such as Table 2), logbook instances contain short text (ranging from 2 to 20 tokens per instance), and utilizing recurrent deep learning algorithms such as LSTM RNNs which are heavily based on the context leads to weak performance compared to other algorithms. One possible explanation is that logbooks with short instances (sequences) are not providing sufficient context for the algorithm to make better predictions. Another could be that RNNs are notoriously difficult to train (Pascanu et al., 2013), and the LSTM models may simply require more training time to achieve similar results. There is some evidence for this, as the dataset with the most instances, which also had the second largest number of tokens per instance on average was Faci-Main, which is the dataset which the LSTM model had the closest performance to the CNN and BERT models, and was also the only one which the LSTM model outperformed the DNN model.
The pre-trained BERT model provided a reasonable classification performance compared to the other deep learning models, however as BERT is pre-trained on standard language, the performance when applying to logbook data was not optimal. Training or fine-tunning BERT to technical logbook data is likely to improve performance as observed in the legal and scientific domains (Chalkidis et al., 2020;Beltagy et al., 2019). As training or finetuning BERT requires large amounts of data, a limitation for fine-tuning a domain-specific BERT is the amount of logbook data available.

Conclusion and Future Work
This work focused on predictive maintenance and technical event/issue classification, with a special focus on addressing class imbalance. We acquired seven logbook datasets from three technical domains containing short instances with non-standard grammar and spelling, and many abbreviations. To address RQ1, we evaluated multiple strategies to address the extreme class imbalance in these datasets and we showed that the feedback loop strategy performs best, almost entirely providing the best results for the 7 different datasets and 4 different models investigated. To address RQ2, we empirically compared different classification algorithms (DNN, LSTM, CNN, and pre-tuned BERT). Results show that the CNN model outperforms the  other classifiers. The methodology presented in this paper could be applied to other maintenance corpora from a variety of technical domains. The feedback loop approach for selecting training data is generic and could easily be applied to any learning problem with substantial class imbalances. This is useful as extreme class imbalance is a challenge at the heart of a number of natural language tasks.
In future work, we would like to fine-tune BERT using logbook data, as described in Section 6, and extend this work to datasets in other languages. The biggest challenge for these two research directions is the limited availability of logbook datasets. Furthermore, we are exploring various methods of domain adaptation and transfer learning on these datasets to further improve the performance of classification models.