Comprehensive Punctuation Restoration for English and Polish

Punctuation restoration is a fundamental re-quirement for the readability of text derived from Automatic Speech Recognition (ASR) systems. Most contemporary solutions are limited to predicting only a few of the most frequently occurring marks, such as periods, commas, and question marks — and only one per word. However, in written language, we deal with a much larger number of punctuation characters (such as parentheses, hyphens, etc.), and their combinations (like parenthesis followed by a dot). Such comprehensive punctuation cannot always be unambiguously reduced to a basic set of the most frequently occurring marks. In this work, we evaluate several methods in the comprehensive punctuation reconstruction task. We conduct experiments on parallel corpora of two different languages, English and Polish — languages with a relatively simple and complex morphology, respectively. We also investigate the inﬂuence of building a model on comprehensive punctuation on the quality of the basic punctuation restoration task.


Introduction
The task of restoring punctuation can be crucial for the readability of text derived from ASR systems. As Tündik et al. (2018) has shown, a lack of punctuation in transcription can have a greater negative impact on readability than a large number of word transcription errors. In recent years, punctuation prediction was most often approached as a token classification task (Tilk and Alumäe (2016), Kim (2019), Alam et al. (2020)). In this context, the target labels are often reduced to only a few most frequently occurring marks, such as periods, commas, and question marks. However, in written language, we deal with a much larger number of characters (such as parentheses, hyphens, etc.). The usual approach is to try to reduce those punctuation marks into the basic set via role similarity (e.g., semicolon and exclamation marks are often reduced to periods) or discard them entirely (Tilk and Alumäe (2016), Żelasko et al. (2018)). However, such a process always comes with a loss of information. Furthermore, a word can end with more than one punctuation mark -for example, the end of parenthesis can coincide with the end of a sentence, resulting in the combination of these marks into '). '. Predicting only a period in such a place would quite strongly violate the structure of the original statement (see Table 1). We propose a new approach to the punctuation restoration task -Comprehensive Punctuation Restoration -where the task will be to restore all the original punctuation in the text (i.e., without any reduction) in a token classification manner.
In the following work, we explore the possibility of generating a manageable-sized set of labels directly from the dataset, based on the percentile of punctuation cases present in the set. We measure increased recall by using broader class sets and a potential cost in terms of precision. In addition, we test whether models trained on more narrowly defined classes will suffer (or gain) on a reduced, conventionally defined 4-class task.
We conducted our research on a parallel corpus of Polish and English -two languages with very different levels of morphological complexity (Łockiewicz and Jaskulska, 2017). With this approach, we can directly compare a set of semantically identical and volumetrically very similar datasets and see how well our results generalize. We would be able to catch if a trend in our results was very specific to a single language.
In summary, in this paper we made the following contributions: • We propose an approach to generate a comprehensive punctuation label set directly from the dataset rather than some predefined marks.
• We evaluate how increasing the size of gen-Type of restoration Text 4 classes our way is to honor every religion and every nation according to their paths, as it is written in the book of prophets because every nation will go in the name of its lord. 4 classes + mapping our way is to honor every religion and every nation according to their paths, as it is written in the book of prophets, because every nation will go in the name of its lord. full restoration our way is to honor every religion and every nation according to their paths, as it is written in the book of prophets: 'because every nation will go in the name of its lord.' erated label set affects the ability to restore complete punctuation.
• We investigate whether using a large, narrowly-defined set of labels affects the performance of the model on a frequently used, basic set of 4 classes: PERIOD, COMMA, QUESTION, and OTHER.
Also, to the best of our knowledge, our proposal is the first publicly described research for restoring punctuation for the Polish language. 1

Related Work
The first approach to punctuation restoration (in the sense of restoring punctuation marks) has been proposed by Beeferman et al. (1998). They introduced a model based on the Markov chain, designed for restoring commas in the output of ASR systems. In the field of deep learning, the punctuation restoration task is often approached with bidirectional recurrent neural networks. Most often LSTM and GRU architectures are used. Although LSTM networks are often -computational performance aside -considered better than GRUs in the general case (Yang et al. (2020), Weiss et al. (2018)), it is reported in several papers that GRUs outperformed LSTMs in the punctuation restoration task (Tilk and Alumäe (2016), Hládek et al. (2019)).
In Tilk and Alumäe (2016) authors explored the possibility of using bidirectional recurrent networks with attention for the punctuation restoration 1 There is the Polish language mentioned as a part of a multilingual model in Li and Lin (2020), however, the authors did not publish per-language results. on the Estonian language. They provided their code 2 with the publication and we will be using it in our research as an example of a recurrent network model.
Lately, an interesting approach based on LSTMs was proposed by Li and Lin (2020), where they tried to create a single model for restoring punctuation for 43 languages using language-independent BPE tokenization. They also included Polish in the training set, however, authors did not publish per-language results.
In recent years, large, pre-trained models based on transformer architecture (Vaswani et al., 2017) seem to perform best on a number of NLP tasks, including punctuation restoration. Perhaps the most comprehensive comparison of various transformer encoder models in the task of punctuation restoration is done by Alam et al. (2020), where the authors compared a number of models based on different variants of pre-trained BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) and AlBERT (Lan et al., 2020) encoders as a base of their model. They've shown that generally larger pre-trained models were better than the smaller ones in punctuation restoration and that between models of the same size, generally, RoBERTa was better than both AlBERT and BERT. They also showed that XLM (cross-lingual models) variants of RoBERTa were slightly worse than English-only ones. The authors of the paper published their code and we also used it in our research.
In Yi et al. (2020) authors show that punctuation restoration can also benefit from multitask learning (POS tagging being the secondary task in their work). They trained a single BERT-based model with 2 token classification heads -one for Punctuation restoration and one for POS tagging. While the weights in the heads were separate in the task, the BERT core was shared. They have shown that such a form of regularization can help in the punctuation restoration of unseen data.
Our take on the punctuation restoration task is inspired by the work of Omelianchuk et al. (2020), where the authors used an approach with automatic generation of a set of labels from the data to approximate the capabilities of sequence to sequence models token classification. They did it in the context of the grammatical error correction task.  To be able to research comparable corpora for different languages, similarly to Vandeghinste et al. (2018), we used a parallel corpus from the Europarl v7 dataset 3 . The corpus is extracted from the proceedings of the European Parliament and translated into multiple languages. Specifically, we use the parallel corpus of Polish and English taken from proceedings from 01/2007 to 11/2011. The corpus is made up of 15.27M words (English) and 12.82M words (polish) divided into sentences, with each sentence on a separate line. As some of the lines are very short and contain e.g. only a single number, we removed all of the lines that had fewer than 4 words as a preprocessing step. Then we divided the corpus randomly into training, validation, and test collection in the ratio 8/1/1 (line-wise). See Table  2 for information on the size of each collection.
The text preprocessing step consisted only of normalization of all whitespace characters (including newline) into a single space. The decision was motivated by the fact that whitespace is mostly connected with formatting rather than punctuation. In the specific case of the dataset we used, new lines were used to separate sentences. However, if the dataset was annotated in a way that whitespace formatting was meaningful (ie., using newlines or tabulations for paragraph splitting), this step could  be skipped and attempts could be made to also reproduce subtle differences in whitespace. After preprocessing, the text was broken into tokens based on the occurrence of any nonalphanumeric character (including whitespace). Each alphanumeric sequence was considered a single token, and each non-alphanumeric sequence following it was considered a label of that token. The set of all unique non-alphanumeric sequences was considered the largest possible set of punctuation labels for this specific dataset.
The classes from the label set were then sorted by their frequency of occurrence in the text. Obviously, in most cases, by far the most represented class was single whitespace (that made up 88.36% of all labels in the English dataset and 85.14% in Polish). Overall, we got 513 unique classes for the English version and 716 classes for Polish one. In both cases, there was a long tail of underrepresented classes. Such classes consisted mainly of combinations of rare marks (eg. ""=.") or very long strings of punctuation characters (e.g., " [.../...] ["). In the case of the English dataset, there were 330 classes with fewer than 5 occurrences and 186 classes with only one occurrence. In the case of Polish, such a long tail was even longer, with 486 classes with less than 5 occurrences and as many as 287 with only one occurrence.
As stated in the introduction, the goal of this work is to reproduce as much of the original punctuation as possible. Because of that, in the test set no class reduction was done and all the original labels were put there in unchanged form (even if the class had only one occurrence in the test set and no occurrence in the training set). However, training a model on classes that had only a few samples would be impractical. Because of that, the training and validation sets were reduced in such a way that we maximize class coverage. To do that, we created a set of label subsets according to a minimum number of classes to achieve a given percentile (not counting single whitespace). The percentiles with a corresponding number of classes are presented in Table 3.
As the last step, we created another version of the label set with the number of classes reduced to a commonly used quadruple of labels: COMMA, PE-RIOD, QUESTION, OTHER. We used this version of the dataset to test whether training models on a larger, more narrowly defined set of labels would have negative effects relative to models trained on a small label set. To map the comprehensive label set into a simple label set we used inclusion criteria. I.e., if the character "." was present in the comprehensive label (e.g. "". "), it would be mapped into PERIOD class. If more than one base class were found in the comprehensive label, then the more frequently represented one would be chosen (the precedence order was a comma, period, question). If no base label was found, OTHER class was assigned. This variation of label set will be referred to as "Reduced" 4 Experimental Setup

Overview
For our experiments, the first architecture we used was bidirectional GRU with an attention model, described in Tilk and Alumäe (2016). Originally, the authors of this paper also tested their solution on a corpus derived from Europarl v7 (though not a parallel one). In their case, they used a total of 8 classes (consisting of 7 punctuation marks plus class representing no punctuation). This set of labels will be further marked as "Base Tilk ". As for hyperparameters, we used the ones suggested by the authors (learning rate of 0.02 and hidden layer size of 256).
The next set of architectures we examined were base-sized transformer models derived from Alam et al. (2020). We used BERT and RoBERTa for our study of the English dataset and Bert for Polish. In Alam et al. (2020), the authors provide a standard set of 4 classes -period, comma, question mark, and the 'other' class (which also contains no punctuation). This set of 4 classes will be marked as "Base Alam ". For the pre-trained Polish Bert models, we used the one trained by Kłeczek (2020), hosted on huggingface model repository 4 . We used the cased version because the author recommends using it over the uncased version. The only changes we made to the original code from Alam et al. (2020) are those allowing us to change the scope of the predicted classes and to incorporate more pre-trained models (Polish ones). For the hyperparameters, we used a learning rate of 10 −5 , batch size of 8, augmentation rate of 0.15 with alpha-sub and alpha-del set to 0.4. We trained each model for 10 epochs.
We first trained the described models on the dataset with labels mapped to the original set of labels (i.e., the sets that were used in the original implementations and marked as "Base"). Base sets were mapped to comprehensive labels by matching the most frequently represented label containing a character from the base set. For example, the base label "!" would be mapped to the comprehensive label "! ". Labels that were not mapped were replaced with a single whitespace label (" "), representing no punctuation.
We then incrementally increased the number of labels in the training and validation sets such that they covered the 90th, 95th, and 99th percentiles of all the original punctuation (see Table 9 and 10). On those models, we examined how increasing the size of a training label set would affect the precision and recall of punctuation in the original texts, on the test set. In each experiment, the test set contained all original labels (i.e., 513 for the English set and 716 for the Polish set).
At last, we trained the models on the reduced dataset. Those models will be mainly used as a baseline to check whether training the models on comprehensive label sets would have a positive or negative effect on the quality of model performance for the core classes. It is worth noting that the models trained on this set will attempt to predict the labels greedily (i.e., the models are trained to predict the label ". " even when the label "". " was originally present). For this reason, these models will achieve lower average precision on the comprehensive punctuation restoration task.
All the experiments were performed on following hardware: RTX 2080 Ti, Intel(R) Xeon(R) CPU E5-2650, 503Gb of RAM. The longest single finetuning process took 5 hours 43 minutes (BERT on English dataset).

Token classification metrics
For the comprehensive punctuation restoration task, we used precision (P), recall (R), and f1 computed as micro-averages of all classes excluding a single whitespace class (i.e., the dominant class corresponding to the absence of punctuation). Predictions are marked as correct only if the model predicted the exact class (i.e., predicting '. ' for a token with ground truth label '). ' would be counted as an error). For a task with reduced labels, we used precision, recall, and f1 for classes COMMA, PERIOD, and QUESTION. We also computed the macro average of those metrics under the TOTAL section.

mRS
Token classifications metrics are very strict (i.e., if the true test label was '). ' and the model predicted '. ' it would still count as a full error). Intuitively, if the model predicted a label that had some common part with a true label, it should be counted as a better score than predicting a completely wrong one. To address this issue, we used a third metricmean Ruzicka similarity (mRS). Ruzicka similarity (Deza and Deza, 2009) is a weighted version of Jaccard similarity that allows us to work on a multiset (e.g. labels like '...'). It has values in the range (0,1) where 1 is achieved for a perfect match, higher values mean better results. In our application, this metric is defined as follows: are predicted and ground-truth labels of the same token, represented by a vector consisting of the count of all single-character punctuation marks in that label (excluding whitespace).
To compute mean RS we just average RS metric over all labels, skipping the tokens where the ground-truth label is whitespace only (i.e., no punctuation).
where P i is predicted label for i th token, T i is ground-truth label for i th token and N is a total number of tokens.

Results for English
Example predictions (on an excerpt from the test set) from the best model for English (RoBERTa) trained on a different number of training labels is shown in Table 4, whereas the metrics for all experiments are presented in Table 5.
As expected, increasing the number of classes on which the model was trained increases the average recall (R). Depending on the method, the increase over the method's native class list was between 12 and 14 percentage points. As for averaged precision (P), its clear decrease was observed only in the case of the BiGRU model. In models based on pre-trained Berts, the highest precision was obtained with an increased number of classes. Since the fluctuation of precision with increasing label set was relatively small compared to the gain on recall, the f1 metric in each case increased with the increasing number of classes. Also, the less rigid mRS metric showed an average gain of about 14 points when using models trained on 99 th percentile. This number shows how much we would be losing when we would reduce the punctuation to the base set. Table 6, on the other hand, presents a comparison of the performance of the models on the reduced set of labels. It can be observed that the number of labels (L column) on which the model was trained did not have a major impact on the quality of the task in the basic formulation of the problem. The only clear decrease can be seen in the model learned at 90% label coverage. This model was unable to restore question marks because its training set, whose labels were formed from the first labels sorted by the frequency of occurrence, did not include any class containing a question mark. The lack of decrease in performance on this task shows that the current deep models are capacious enough that increasing the range of labels (and thus both the resolution and the range of predicted punctuation marks) does not carry a cost in terms of a decrease in a model quality on the prediction of the more salient marks.

Results for Polish
The results for the Polish language are presented in Table 7. In general, Polish turned out to be GOLD as a result of many inspections, payments of own resources and interest were demanded. for agriculture as a whole -C49.8 billion in 2006 -the court found a marked reduction in the estimated overall level of error.   a slightly easier punctuation restoration task as a whole. The best f1 score obtained for Polish was 85.93 as compared to 84.29 obtained for English. In the case of Polish, the effect of adding subsequent classes on increasing recall was smaller (although still relatively large). In the BERT model, adding more classes strictly decreased the average prediction precision, but in the recursive model (Bi-GRU), no clear trend was observed. This is somewhat opposite to the results obtained on the English set. Similarly to English, we also tested whether models trained on the larger label set would decrease in performance on the baseline task of 4 classes. The results of the models on this task are presented in Table 8. There was no noticeable effect of increasing the number of labels on the quality of the model in predicting basic labels. For Polish, we found that the task of restoring commas was easier, while that of restoring question marks was much more difficult. We suspect that this might be rooted in the structure of language because, in the Polish language, one can often come across question structures that differ from the indicative sentence only by the question mark at the end -e.g., it's common to use structures like "jesteś szczęśliwy?" ("you are happy?") rather than "czy jesteś szczęśliwy?" (that would resemble "are you happy?"). However, to make a definite statement, it would be necessary to conduct further research in this area, especially since the basis of BERT's methods is a language model, which for obvious reasons was pre-trained on different sets for each of the two languages.

Conclusion and Future Work
In our work, we have shown that token classifier models are able to restore a much larger range of punctuation than it is done in most other reported researches. Our experiments show that such an increase in coverage can be achieved without a drop in quality for key punctuation marks. We have also shown that this effect is not limited to English, and we have obtained very similar results in Polisha language with much more complex morphology. Additionally, the advantage of the approach with automatic generation of a set of labels from the data is that we are also able to predict the composition of punctuation marks. In further work, it would be of great benefit to investigate what effect reproducing a wide range of punctuation would have on text readability for people compared to reproducing only the basic characters. It would also be interesting to perform a comparative study of how token classifier models perform in the task of reproducing broad punctuation compared to sequence-tosequence models, such as Bart (Lewis et al., 2020) or T5 (Raffel et al., 2020) for which such behavior would be natural. We also plan to take part in the   PolEval 2021 5 shared task, concerning punctuation restoration from read text in Polish. In contrast to the problem analyzed here, the data sets will contain acoustic information, e.g. one that could allow determining the duration of gaps between words.