PunKtuator: A Multilingual Punctuation Restoration System for Spoken and Written Text

Text transcripts without punctuation or sentence boundaries are hard to comprehend for both humans and machines. Punctuation marks play a vital role by providing meaning to the sentence and incorrect use or placement of punctuation marks can often alter it. This can impact downstream tasks such as language translation and understanding, pronoun resolution, text summarization, etc. for humans and machines. An automated punctuation restoration (APR) system with minimal human intervention can improve comprehension of text and help users write better. In this paper we describe a multitask modeling approach as a system to restore punctuation in multiple high resource – Germanic (English and German), Romanic (French)– and low resource languages – Indo-Aryan (Hindi) Dravidian (Tamil) – that does not require extensive knowledge of grammar or syntax of a given language for both spoken and written form of text. For German language and the given Indic based languages this is the first towards restoring punctuation and can serve as a baseline for future work.


Introduction
Automatic speech recognition (ASR) has become ubiquitous these days and has wide applications in business and personal life. One of the drawbacks of ASR is it produces an unpunctuated stream of text. Restoring punctuation manually is a timeconsuming task. Apart from spoken text a large amount of written text online -blogs, articles, social media,etc. -sometimes lack the appropriate punctuation marks due to human inconsistencies, which can alter the meaning of text. An APR system designed with an understanding of ASR and written forms of text can help resolve these issues. Transcriptions passed to an APR system, can improve the following machine learning tasks such as machine translation, conversational agents, coreference resolution, etc. Further it can be used as an unsupervised auxiliary or pretext task, for training large scale transformer language models, as it would require understanding about global structure of the text. Prior punctuation restoration methods have mostly been solved using lexical features, prosodic features or combination of both. Due to large availability of text data, majority of the methods have focused on using lexical features. Early methods (Christensen et al., 2001) used Hidden Markov Models (HMM) to model punctuation using acoustic features such as pause duration, pitch and intensity. Though the acoustic based models perform well on ASR system, they can perform better when combined with textual data. Liu et al. (2006); Batista et al. (2007); Kolář and Lamel (2012) proposed various methods that combined lexical features along with prosodic information thereby improving APR tasks. Alumäe (2015, 2016) proposed unidirectional and bidirectional Long Short Term Memory (Bi-LSTM) based punctuation prediction model which did not require extensive feature engineering. Though the above method considered the long distant token dependencies, it ignored label dependencies. To address label dependencies (Klejch et al., 2017) made use of recurrent neural networks for sequence to sequence mapping using an encoder-decoder architecture. Recently the use of transformer based approaches combination of speech and pre-trained word embeddings have achieved state of art performance on IWSLT datasets (spoken transcripts from TED talks for ASR tasks, but often used as benchmark for comparison of punctuation restoration models). Yi et al. (2020) used pretrained BERT (Devlin et al., 2018) that is used to perform adversarial multi-task learning to restore punctuation. Alam et al. (2020)  and used an augmentation strategy to make models more robust to ASR errors. Though most approaches have shown considerable improvement in overcoming some of the challenges faced in terms of modeling and achieving the state of performance in spoken language transcripts in English, there are the following limitations: • Restoring punctuation varies in spoken and written text due to differences in rules of writing and speaking. The frequent use of personal pronouns, colloquial words and usage of direct speech often results in more varied use of punctuation in spoken text as compared to written text. This often affects readability for humans and machines.
• Though there has been some research (Tilk and Alumäe, 2016;Kolář and Lamel, 2012;Alam et al., 2020) that has focused on developing non-english APR system, extensive research and baseline results have not been studied for other languages.
To overcome some of the challenges, we make the following contributions: • We implemented a multi-task multilingual punctuation restoration model. Our technique implements punctuation restoration task as sequence labeling task, which is jointly trained with language classifiers and text mode classification ('Spoken' and 'Written'). We use the proposed technique to build two multilingual models for high resource and low resource languages, thereby reducing the dependency of multiple monolingual language models.
• We developed a web browser extension that can help multilingual spoken and written users to punctuate transcripts as a post-processing step. We have made a demo of the web extension available online. 1 • We prepared training and test datasets and evaluated the performance of our proposed model. Further to evaluate the generalization of the model we evaluated across the benchmark IWSLT reference dataset. The code and models have been made publicly available. 2 2 Punctuation restoration system

Data Gathering
Due to varying set of language data, we segregated the data sources according to the languages, which we gathered for spoken and written text. For Written text we considered data from news web sources.

High Resource Languages
For high resource European languages, we considered a parallel sentence corpus known as the 'EU-ROPARL' corpus (Vanmassenhove and Hardmeier, 2018) for spoken text. This corpus is a collection of speeches made in the proceedings of European parliament from 1996 to 2012, transcribed as text.
To gather written text we used news articles from Alexa's top-25 ranked news sources. These were publicly available 3 for every language.

Low Resource Languages
Due to lack of language resources available for indic languages for APR, we gather publicly released datasets. For Spoken text we used the Indian Prime Minister's address to the nation. These corpora manually translated into several Indian languages. Written text was obtained from Siripragada et al.
(2020) who crawled articles articles released from the Press Information Bureau (PIB), an Indian government agency that provides information to news media.

Annotation
Due to lack of readily available annotated datasets and large size corpora, we used an automated approach to label the data. We analyzed languages and selected the three most common punctuation -'PERIOD', 'COMMA' and 'QUESTION MARK' -that occurred across the languages for training our model. This was done to improve the readability of text so that could be easily understood by users, one of the goals of the system. Since we treat our task as a sequence labeling task, we annotated every word in the sequence according to the punctuation following it. We achieved this by tokenizing the input text into a stream of word tokens and punctuation tokens. We converted this into a set of pairs of (token, punctuation) where punctuation is the null punctuation ('O'), if there was no punctuation mark following in the text. To make our data set more diverse and training more robust, we ended sentences (10%) a few tokens before the 'PERIOD' tag and labeled the final token as 'EOS' (end of sentence). Further we converted all our text to lowercase to remove any signal while training the language model. The distribution of the labels can be seen in table 1.

Joint Multilingual Model-Architecture
The model consists of four main sub-parts as observed in Figure 2 -  model is used to model token dependencies better, from forward and backward directions. NCRF (Yang and Zhang, 2018) relies on learning the high level features from the deep neural network and passes this information to a linear CRF layer for inference, which helps manage label dependencies. This architecture sequential in nature, is trained for APR task. The output sequence representation from the BILSTM is passed through a max pooling layer, the result of which passed through linear feed forward layer for language and text mode classification. We jointly trained our sequential language model, along with the classifiers.

Web Extension
We created a web extension that can be used to punctuate text within the text editors on web pages. It lets users to select text which could range from a few words to large paragraphs to entire documents to punctuate. The text does not have to be non punctuated as the system removes punctuation as a preprocessing step and punctuates again.The steps to punctuate are shown in Fig 1. 3 Experiments

Experimental setup
We used the pretrained transformer model and specific tokenizers available on HuggingFace 4 . The model architecture consists of the 12 hidden layer encoder, which is used to produce the embeddings. We used an optimized weighting technique (Peters et al., 2018) to sum all the hidden layers rather than use a common practice of using one single layer to generate embeddings. This showed an improvement in performance as seen under ablation studies in table 5. The weighting method is as defined: where • H j is a trainable task weight for the j th layer. γ is another task trainable task parameter that aids the optimization process • S j is the normalized embedding output from the j th hidden layer of the transformer.
• O j is the output vector.
• L is the number of hidden layers.
To train the proposed model, we used a maximum sequence length of 505. We use a subword tokenization technique -sentence piece model (Kudo and Richardson, 2018) -which might result in token length exceeding the maximum sequence length, in such cases we exclude the tokens and start a new paragraph. For sequences less than the specified max sequence length, we pad the sequences to the maximum sequence length and mask the padded sequence to avoid performing attention on it. We used a batch size of 32, grouping similar sequence length prior to padding that enhances the speed while training the model. We do not fine tune the transformer model, but use it to embed the input text. A BILSTM stacked on top of the transformer model, is set to a dimension of 512, the layers are initialized with a uniform distribution in the range of (-.003, .003). A Neural CRF layer is trained with a maximum log-likelihood loss. Viterbi algorithm is used to search for the label sequence with the highest probability during decoding. The entire model was trained with an Adam optimization algorithm with a learning rate close to 1e-4 over 10 epochs. The proposed multitask network was trained via a dynamically weighted averaging (DWA) technique to balance each task. Thereby not allowing one task to dominate over the other or negatively impact the performance of the other. This approach was proposed and utilized for training a multi-task computer vision network (Liu et al., 2019), we followed a similar approach and implemented this on language processing task to show overall improvement in performance. Similar to Gradnorm  which learns to average tasks over time, the DWA method does not use the gradients of network rather uses numeric task loss. The weighting λ j for task j is defined as: L j is the loss function of each task j, so w j is the ratio of loss function over the last two epochs. T represents the temperature, which is used to represent the softness of task weighting. A higher value of T represents a more even distribution between the tasks, when T is high enough, the value of λ j equals 1. K is the total number of the tasks that we are training for. The overall loss is the sum of the individual task loss averaged over each iteration.
where L pr -Maximum Likelihood loss for Punctuation restoration, L lc -Cross Entorpy loss for Language Classification and L tm -Cross entropy loss for text mode classification.

Results
To evaluate the performance of our joint model, we built different multilingual neural models. We split our dataset into two parts -train set (80%),validation set (10%) and test set (10%). The performance for every model was evaluated on test set, after being trained on the train set. We chose F1-score to evaluate the performance of our model. We established a baseline using BILSTM-CRF and pretrained FastText word embeddings (Bojanowski et al., 2017) as features and trained jointly on language and text mode classification tasks. The Fast-Text word embeddings used as features for training are monolingual. To train multilingual models, we developed cross lingual embeddings by aligning monolingual embeddings of different languages along a single dimension using unsupervised techniques (Chen and Cardie, 2018). The parameters and training setup of the baseline was similar to the proposed model, except we used FastText based word embeddings as input features. Further we make comparisons using MBERT and XLM-Roberta as pretrained models. Table 2 shows the performance of the various models on high resource European languages along with their F1 scores. To ensure a fairer comparison, we implemented the trained model by Alam et al. (2020) that achieved state of art performance on IWSLT datasets to evaluate on our test set. The Joint-Multilingual BERT NCRF as proposed in section 2.3 outperforms the other models across spoken and written text for all punctuations. We observe German language performs the best across spoken and written text. The performance of the German language can be attributed to a couple of reasons. In German multiple words can be condensed into a single word. This reduces ambiguity and thus there are fewer decision points for the machine to provide inference on. German is an inflected language i.e the word order changes according to the function in the sentence. Most word orders are  Under dependent clauses, the object must be the second element in the sentence (SOV). This may provide an additional signal to model and that can impact its performance.
Models F1-Score DRNN-LWMA-pre (Kim, 2019) 68.6 Self-Attention (Yi and Tao, 2019) 72.9 BERT-Adversarial (Yi et al., 2020) 77.8 Joint M-BERT (Our Model) 80.3 XLM-R Augmented (Alam et al., 2020) 82.9 To asses the ability of our model to generalize, we evaluated our best performing model on the reference transcripts of the IWSLT dataset. Even though our model was not trained specifically using these datasets, but was able to outperform on some of the prior state of art models as shown in Table  4. The metrics shown refer to the average F1-score. The performance of our proposed models was carried out on the low resource languages for spoken and written transcripts, which can be observed in Table 3 . We obtained the best result using the Joint-Multilingual BERT NCRF model. For low resource languages the performance of Question is lower than the Comma and Period, due to lower number of questions in true label set.  We experimented with different ablations of the best performing model, as seen in table 5 .

Ablation Studies
• BILSTM-NCRF -We do not consider any embeddings and train a simple BILSTM-NCRF model.
• MBERT-NCRF -We removed the BILSTM layer and use only NCRF layer on top of transformers.
• MBERT-BILSTM -We remove the NCRF layer and model only the token dependencies.
• Without weighted layers -We removed the trainable weighing parameters and considered only the top layer of the transformer as input to the BILSTM.
• Without classification layers -We removed the classification layers and trained the model without any auxillary information.

Conclusion
In this paper we described and implemented a joint modeling approach for restoring punctuation for High and low resource languages across spoken and written text. Joint language model trained with auxiliary language and text mode classification improved the performance of the APR task. We achieved reasonable performance on the benchmark IWSLT datasets without being trained on it.
We also presented a web extension that can help multilingual users improve overall readability and coherence of text. Further we present baseline results on indic languages that can be used for future work. We have shown examples of punctuated text that was output from our system in the Appendix section.
Input Text Output Text japan then laid siege to the syrian penalty area for most of the game but rarely breached the syrian defence oleg shatskiku made sure of the win in injury time hitting an unstoppable left foot shot from just outside the area Japan then laid siege to the syrian penalty area for most of the game ,but rarely breached the syrian defence .Oleg shatskiku made sure of the win in injury time ,hitting an unstoppable left foot shot from just outside the area . russia's refusal to support emergency supply cuts would effectively and fatally undermine OPEC+'s ability to play the role of oil price stabilizing swing producer says Rapidan Energy's Bob McNally Russia's refusal to support emergency supply cuts would effectively and fatally undermine OPEC +'s ability to play the role of oil price stabilizing . Swing producer , says Rapidan Energy's Bob McNally . Romeo Romeo wherefore art thou Romeo Romeo , Romeo , wherefore art thou Romeo ? sans préjuger de l'efficacité de ce couvre-feu avancé ces données ne sont toutefois pas si facilement lisibles selon les experts suivant l'épidémie de Covid-19 Tout d'abord on manque encore de recul Sans préjuger de l'efficacité de ce couvre-feu avancé, ces données ne sont toutefois pas si facilement lisibles , selon les experts , suivant l'épidémie de Covid-19 . Tout d'abord , on manque encore de recul .