Comparing Feature-Engineering and Feature-Learning Approaches for Multilingual Translationese Classification

Traditional hand-crafted linguistically-informed features have often been used for distinguishing between translated and original non-translated texts. By contrast, to date, neural architectures without manual feature engineering have been less explored for this task. In this work, we (i) compare the traditional feature-engineering-based approach to the feature-learning-based one and (ii) analyse the neural architectures in order to investigate how well the hand-crafted features explain the variance in the neural models’ predictions. We use pre-trained neural word embeddings, as well as several end-to-end neural architectures in both monolingual and multilingual settings and compare them to feature-engineering-based SVM classifiers. We show that (i) neural architectures outperform other approaches by more than 20 accuracy points, with the BERT-based model performing the best in both the monolingual and multilingual settings; (ii) while many individual hand-crafted translationese features correlate with neural model predictions, feature importance analysis shows that the most important features for neural and classical architectures differ; and (iii) our multilingual experiments provide empirical evidence for translationese universals across languages.


Introduction
Texts originally written in a language exhibit properties that distinguish them from texts that are the result of a translation into the same language. These properties are referred to as translationese (Gellerstam, 1986). Earlier studies have shown that using various hand-crafted features for supervised learning can be effective for translationese classification (Baroni and Bernardini, 2005;Rubino et al., 2016). However, this approach has a number of limitations. Firstly, * Equal contribution. manually designed features may be partial and nonexhaustive in a sense that they are based on our linguistic intuitions, and thus may not be guaranteed to capture all discriminative characteristics of the input data seen during training. Other limitations are related to the difficulties in obtaining linguistic annotation tools (e.g., parsers, taggers, etc.) for some languages, reliance on n-gram counts, limited contexts, corpus specific characteristics, among others. In this work, we compare a standard approach based on hand-crafted features with automatic feature learning based on data, task and learner without prior linguistic assumptions.
Moreover, most previous approaches have focused on classifying translationese in the monolingual setting, i.e. translations come from one or multiple source languages, but the language on which to perform the classification is always the same. To the best of our knowledge, the multilingual setting with multiple source and target languages has not been explored yet. If translationese features are language-independent or shared among languages, multilingual translationese classification experiments would show the effect. We perform binary translationese classification not only in mono-, but also in multilingual settings to empirically verify the existence of translationese universals throughout different source and target languages.
In our work we investigate: (i) How automatic neural feature learning approaches to translationese classification compare to classical feature-engineering-based approaches on the same data. To do this, we use pre-trained embeddings as well as several end-to-end neural architectures.
(ii) Whether it is possible to effectively detect translationese in multilingual multi-source data, and how it compares to detecting translation in monolingual and single-source data in different languages.
(iii) Whether a) translationese features learned in one setting can be useful in a different setting and b) the overhead of training separate monolingual models can be reduced by either multi-source monolingual models for a given target language or even better, a multilingual model. For this we perform cross-data evaluation.
(iv) Whether variation observed in predictions of neural models can be explained by linguistically inspired hand-crafted features. We perform linear regression experiments to study the correlation between hand-crafted features and predictions of representation learning models as a starting point for investigating neural models which do not lend themselves easily to full explainability.
We show that: • representation-learning approaches outperform hand-crafted feature-selection methods for translationese classification, with BERT giving the highest accuracy, • it is possible to classify translationese in the multilingual data, but models trained on monolingual single-source data generally yield better performance than models trained on multisource and multilingual data, • in contrast to hand-crafted feature-based models, neural models perform relatively well on different datasets (cross-data evaluation), and single-source can, to a reasonable extent, be substituted by multi-source mono-and multilingual models, • many traditional hand-crafted translationese features exhibit significant correlation with the predictions of the neural models. However, a feature importance analysis shows that the most important features for neural networks and for classical architectures differ.
The paper is organized as follows. Section 2 describes related work. Section 3 introduces the architectures used in our study. Section 4 discusses the data and presents the main classification results. We perform cross-data evaluations in Section 5 and analyze feature importance and correlation in Section 6. Finally, we summarize and draw conclusions in Section 7.
By contrast, to date neural approaches to translationese (Bjerva et al., 2019; have received less attention. While Bjerva et al. (2019) have used learned language representations to show that the distance in the representational space reflects language phylogeny, Dutta Chowdhury et al. (2020,2021) use divergence from isomorphism between embedding spaces to reconstruct phylogenetic trees from translationese data. Sominsky and Wintner (2019) train a BiLSTM for translation direction identification and report accuracy up to 81.0% on Europarl data. We employ the INFODENS toolkit (Taie et al., 2018) to extract hand-crafted features to train and evaluate a classifier. We use a support vector ma-chine classifier (SVM) with linear kernel, and fit the hyperparameter C on the validation set. For the choice of features, we replicate the setup from (Amponsah-Kaakyire et al., 2021), using a 108dimensional feature vector, inspired by the feature set described in (Rubino et al., 2016). In particular, we use: 1. surface features: average word length, syllable ratio, paragraph length. These surface features can be connected to the simplification hypothesis (Ilisei et al., 2010;, as it is assumed that translations contain simpler shorter words than original texts. 2. lexical features: lexical density, type-token ratio. These lexical features can also be linked to the simplification hypothesis, due to the assumption that original texts have richer vocabulary than translated ones and contain a higher proportion of content words (Laviosa, 1998;Baker et al., 1993).
3. unigram bag-of-PoS: These features correspond to the source interference (shiningthrough) hypothesis , as POS n-grams reflect grammatical structure, which might be altered in translations due to the influence of the source language grammar.
4. language modelling features: log probabilities and perplexities with and without considering the end-of-sentence token, according to forward and backward n-gram language models (n ∈ [1; 5]) built on tokens and POS tags. It is hypothesized that the perplexity of translated texts may be increased because of simplification, explicitation and interference (Rubino et al., 2016).
5. n-gram frequency distribution features: percentages of n-grams in the paragraph occurring in each quartile (n ∈ [1; 5]). This feature could be linked to the normalization hypothesis, according to which translated texts are expected to contain more collocations, i.e. high-frequency n-grams (Toury, 1980;Kenny, 2001).
In our experiments, language models and ngram frequency distributions are built on the training set. The n-gram language models are estimated with SRILM (Stolcke, 2002) and SpaCy 1 is used 1 https://spacy.io/ for POS-tagging. Features are scaled by their maximum absolute values. The full list of 108 features is given in the Appendix A.1.

Embedding-based Classification
3.2.1 Average pre-trained embeddings + SVM (Wiki+SVM) We compute an average of all token vectors in the paragraph, and use this mean vector as a feature vector to train a SVM classifier with linear kernel. We work with the publicly available language specific 300-dimensional pre-trained Wiki word vector models trained on Wikipedia using fastText 2 (Joulin et al., 2016).
(2017) and Gourru et al. (2020) and represent a text as a multivariate Gaussian distribution based on the distributed representations of its words. We perform similarity-based classification with SVMs where the kernel represents similarities between pairs of texts. We work with the same pre-trained Wikipedia embeddings as in Wiki+SVM for the words in the model and initialize the ones not contained in the model to random vectors. Specifically, the method assumes that each word w is a sample drawn from a Gaussian distribution with mean vector µ and covariance matrix σ 2 : A text is then characterized by the average of its words and their covariance. The similarity between texts is represented by the convex combination of the similarities of their mean vectors µ i and µ j and their covariances matrices σ 2 i and σ 2 j : where α ∈ [0,1] and the similarities between the mean vectors and co-variances matrices are computed using cosine similarity and element-wise product, respectively. Finally, a SVM classifier is employed using the kernel matrices of Equation 2 to perform the classification.

fastText classifier (FT)
fastText (Joulin et al., 2016) is an efficient neural network model with a single hidden layer. The fast-Text model represents texts as a bag of words and bag of n-gram tokens. Embeddings are averaged to form the final feature vector. A linear transformation is applied before a hierarchical softmax function to calculate the class probabilities. Word vectors are trained from scratch on our data.

Pre-trained embeddings + FT (Wiki+FT)
In this model we work with the pre-trained word vectors from Wikipedia to initialize the fastText classifier. The data setting makes this directly comparable to Wiki+SVM, a non-neural classifier.

Long short-term memory network (LSTM)
We use a single-layer uni-directional LSTM (Hochreiter and Schmidhuber, 1997) with embedding and hidden layer with 128 dimensions. The embedding layer uses wordpiece subunits and is randomly-initialised. We pool (average) all hidden states, and pass the output to a binary linear classifier. We use a batch size of 32, learning rate of 1·10 −2 , and Adam optimiser with Pytorch defaults.

Simplified transformer (Simpl.Trf.)
We use a single-layer encoder-decoder transformer with the same hyperparameters and wordpiece embedding layer as the LSTM. The architecture has no positional encodings. Instead, we introduce a simple cumulative sum-based contextualisation.
The attention computation has been simplified to element-wise operations and there are no feedforward connections. A detailed description is provided in Appendix A.2.

Bidirectional Encoder Representations from Transformers (BERT)
We use the BERT-base multilingual uncased model (12 layers, 768 hidden dimensions, 12 attention heads) (Devlin et al., 2019). Fine-tuning is done with the simpletransformers 3 library. For this, the representation of the [CLS] token goes through a pooler, where it is linearly projected, and a tanh activation is applied. Afterwards it undergoes dropout with probability 0.1 and is fed into a binary linear  classifier. We use a batch size of 32, learning rate of 4 · 10 −5 , and the Adam optimiser with epsilon 1 · 10 −8 . Models were fine-tuned on 4 GPUs.
We design and compare our "lean" single-layer LSTM and simplified transformer models with BERT in order to investigate whether the amount of data and the complexity of the task necessitate complex and large networks. 4 4 Translationese Classification

Data
We use monolingual and multilingual translationese corpora from Amponsah-Kaakyire et al.
(2021) which contain annotated paragraphs (avg. 80 tokens) of the proceedings of the European parliament, the Multilingual Parallel Direct Europarl 5 (MPDE). Annotations indicate the source (SRC) and target languages (TRG), the "original" or "translationese" label, and whether the translations are direct or undefined (possibly translated through a pivot language). As texts translated through a pivot language may have different characteristics from directly translated texts, here we only use the direct translations. For the initial experiment we focus on 3 languages: German (DE), English (EN) and Spanish (ES). We adopt the following format for data description: we refer to translationese corpora (i.e. corpora where half of the data is originals, half translationese) with the "TRG-SRC" notation (with a dash): TRG is the language of the corpus, SRC is the source language, from which the translation into the TRG language was done in order to produce the translationese half. The "TRG←SRC" notation (with an arrow) denotes the result of translating a text from SRC into TRG language. We use it to refer only to the  For all settings we perform binary classification: original vs. translated.

Results
Paragraph-level translationese classification results with mean and standard deviations over 5 runs are reported in Table 2. Overall, the BERT model outperforms other architectures in all settings, followed closely by the other end-to-end neural architectures. Using the pre-trained Wiki embeddings helps improving the accuracy of the fastText method in all cases. Among the approaches with the SVM classifier, Wiki+SVM performs best in the single-source settings, but shows lower accuracy than Handcr.+SVM in the multi-source (TRG-ALL) settings. Wiki+Gauss.+SVM performs worst apart from on ES-EN and DE-ALL.
In the monolingual single-source settings, we observe that accuracy is slightly lower when the source language is typologically closer to the text language, i.e. it becomes more difficult to detect translationese. Specifically, DE-EN tends to have lower accuracy than DE-ES; EN-DE lower accuracy than EN-ES; and ES-EN lower accuracy than ES-DE. Accuracy generally drops when going from single-source to the multi-source setting, e.g. from DE-EN and DE-ES to DE-ALL. The EN-ALL dataset is the most difficult for most of the models among the TRG-ALL datasets. The ALL-ALL[3] setting exhibits comparable accuracy to the TRG-ALL setting for the neural models, but for the SVM there is a drop of around 9 points. Throughout our discussion we always report absolute differences between systems. The ALL-ALL[8] data results in reduced accuracy for most architectures, except Handcr.+SVM.
Neural-classifier-based models substantially outperform the other architectures: the SVMs trained with hand-crafted linguistically-inspired features,  e.g., trail BERT by ∼20 accuracy points.
To make sure our hand-crafted-feature-based SVM results are competitive, we compare them with  on our data.  show that training a SVM classifier on the top 1000 most frequent POS-or character-trigrams yields SOTA translationese classification results on Europarl data. On our data, POS-trigrams yield around 5 points increase in accuracy for most of the datasets and character-trigrams tend to lower the accuracy by around 4 points (Appendix A.3). For the remainder of the paper we continue to work with our handcrafted features, designed to capture various linguistic aspects of translationese.

Multilinguality and Cross-Language Performance
Since neural architectures perform better than the non-neural ones, we perform the multilingual and cross-language analysis only with the neural models. We evaluate the models trained on one dataset on the other ones, in order to verify: • Whether for a given target language, the model trained to detect translationese from one source language, can detect translationese from another source language: TRG-SRC 1 on TRG-SRC 2 , and TRG-SRC on TRG-ALL; • How well the model trained to detect translationese from multiple source languages can detect translationese from a single source language: TRG-ALL on TRG-SRC, and ALL-ALL[3] on TRG-SRC; • How well the model trained to detect translationese in multilingual data performs on monolingual data: ALL-ALL[3] on TRG-ALL, and ALL-ALL[3] on TRG-SRC. Table 3 shows the results of cross-data testing for the monolingual models for the best-performing architecture (BERT). For the single-source monolingual models, we observe a relatively smaller drop (up to 13 percentage points) in performance when testing TRG-SRC on TRG-ALL (as compared to testing TRG-SRC on TRG-SRC), and a larger drop (up to 27 points) when testing TRG-SRC 1 on TRG-SRC 2 (as compared to testing TRG-SRC 1 on TRG-SRC 2 ). The fact that classification performance stays above 64% confirms the hypothesis that translationese features are sourcelanguage-independent.
Another trend that can be observed is that in cross testing TRG-SRC 1 and TRG-SRC 2 , the model where the source language is more distant from the target suffers larger performance drop when tested on the test set with the closer-related source language, than the other way around. For instance, the DE-ES model tested on the DE-EN data suffers a decrease of 17.8 points, and DE-EN model tested on the DE-ES data suffers a decrease of 9.8 points. This may be due to DE-EN having learned more of the general translationese features, which helps the model to obtain higher accuracy on the data with a different source, while the DE-ES model may have learned to rely more on the language-pair-specific features, and therefore it gives lower accuracy on the data with the different source. A similar observation has been made by Koppel and Ordan (2011).
For the multi-source monolingual models (TRG-ALL), testing on TRG-SRC 1 and TRG-SRC 2 datasets shows a slight increase in performance for a source language that is more distant from the target, and a slight decrease for the more closelyrelated source language (as compared to testing TRG-ALL on TRG-ALL).    We also compare the performance of ALL-ALL[3] on different test sets to the original performance of the models trained on these datasets (in parentheses). There is a relatively larger drop in accuracy for the TRG-SRC data, than for TRG-ALL data. The largest drop for neural models is 6.7 accuracy points whilst the smallest performance drop for the Handcr.+SVM is 12.7. This highlights the ability of the neural models to learn features in a multilingual setting which generalize well to their component languages whereas the Handcr.+SVM method does not seem to work well for such a case. However, for ALL-ALL[8] models, Table 5 shows a large performance drop across all architectures as compared to the results from the models specifically trained for the task. The actual models are trained on languagespecific features, whereas the ALL-ALL[8] model is trained on more diverse data containing typologically distant languages and thus captures less targeted translationese signals.
In summary, we observe that: • For a given target language, even though a neural model trained on one source language can decently identify translationese from another source language, the decrease in performance is substantial.
• Neural models trained on multiple sources for a given target language perform reasonably well on single-source languages.
• Neural models trained on multilingual data ALL-ALL[3] perform reasonably well on monolingual data, especially for multi-source monolingual data.  Figure 1: Top 10 SVM features, as a function of the absolute value of its feature weight.
• Using more source and target languages (ALL-ALL[8]) leads to a larger decrease in cross-testing accuracy.

Feature Importance and Relation to Neural Models
In this section we aim to quantify the feature importance of the hand-crafted linguistically inspired features used in Handcr.+SVM according to different multilingual models (ALL-ALL[3] setting). As we use a Support Vector Machine with a linear kernel, we can interpret the magnitude of the feature weights as a feature importance measure. Guyon et al. (2002) for instance, use the squared SVM weights for feature selection. We rank the features by the absolute value of the weight. The feature ranks are listed in Appendix A.1. Figure 1 shows the top 10 features. Paragraph length is the most relevant feature, and we observe that most of the top features correspond to paragraph log probability. These features characterize simplification in translationese.
To explore whether there is any correlation between the hand-crafted features and predictions of the trained neural models, we conduct the following experiment in the multilingual setting. We fit a linear regression model for each hand-crafted feature, using the estimated probabilities of neural model as gold labels to be predicted. More formally, with n paragraphs (p i , i = 1...n) in the test set and d features, for each feature vector x j ∈ R n , j = 1...d we fit a model where w j , b j ∈ R are the model parameters, and y ∈ R n is a vector of predictions of the neural model F (LSTM, Simplified Transformer, BERT) on the test set, with each dimension y i showing the probability of a data point to belong to the translationese class: We apply min-max normalization to the features. We find that a large proportion of the linguistically motivated features are statistically significant for predicting the neural models' predicted probabilities, namely 60 features (out of 108) are significant for LSTM, 38 for the Simplified Transformer, and 56 for BERT, each with probability 99.9%. We also fit the per-feature linear models to predict the actual gold labels (and not the predictions of the neural models) to investigate which features correlate with the ground truth classes, and find 55 features to be statistically significant with 99.9% probability. The full list of statistically significant features for each model, as well as for the gold labels is given in the Appendix A.1. We observe that the features significant for the neural models largely overlap with the features significant for the gold labels: the F 1 -score (as a measure of overlap) is 0.89 for LSTM, 0.75 for Simplified Transformer and 0.99 for BERT. This is expected, because highperforming neural models output probabilities that are generally close to the gold labels, therefore a similar correlation with hand-crafted features occurs.
The R 2 measure is further used to rank features based on the amount of explained variance in predictions of a model. The top 10 features for predicting the predictions of each neural model and for predicting the actual gold labels are displayed in Figure 2. The order of top features is similar across the neural models (pairwise rank correlations ρ Spearman of at least 0.76), and similar to, but not identical to, the gold label results (pairwise rank correlations ρ Spearman of at least 0.75). We observe that most of the top features are either POSperplexity-based, or bag-of-POS features. These features characterize interference in translationese. It also appears that more importance is attached to perplexities based on unigrams and bigrams than on other n-grams. Notably, the order of feature importance for the neural models is highly dissimilar from the order of hand-crafted feature weights for the SVM (pairwise rank correlations ρ Spearman at most 0.23). This might be connected to an accuracy gap between these models.
We conclude that many of hand-crafted translationese features are statistically significant for  predicting the predictions of the neural models (and actual gold labels). However, due to the low R 2 values, we cannot conclude that the handcrafted features explain the features learnt by the representation-learning models.

Summary and Conclusion
This paper presents a systematic comparison of the performance of feature-engineering-based and feature-learning-based models on binary translationese classification tasks in various settings, i.e., monolingual single-source data, monolingual multi-source data, and multilingual multi-source data. Additionally, we analyze neural architectures to see how well the hand-crafted features explain the variance in the predictions of neural models. The results obtained in our experiments show that, (i) representation-learning-based approaches outperform hand-crafted linguistically inspired feature-selection methods for translationese classification on a wide range of tasks, (ii) the features learned by feature-learning based methods generalise better to different multilingual tasks and (iii) our multilingual experiments provide empirical support for the existence of language independent translationese features. We also examine multiple neural architectures and confirm that translationese classification requires deep neural models for optimum results. We have shown that many traditional hand-crafted translationese features significantly predict the output of representation learning models, but may not necessarily explain their performance due to the weak correlation. Our experiments also show that even though single-source monolingual models yield the best performance, they can, to a reasonable extent, be substituted by multi-source mono-and multi-lingual models.
Our interpretability experiment provides only some initial insight into the neural models' performance. Even though there are significant relationships between many of the features and the neural models' predicted probabilities, further experiments are required to verify that the neural models actually use something akin to these features. Also our current approach ignores interaction between the features. In the future, we plan to conduct a more detailed analysis of the neural models' decision making.

A Appendix
A.1 List of hand-crafted features Col. 3: SVM feature importance ranks (ranked by absolute feature weight) for the model trained on the ALL-ALL[3] set.
Col. 4-7: Statistical significance of the features as predictors in per-feature linear regression with respect to neural models' predicted probabilities and gold labels (1 -significant with 99.9% confidence level) on the ALL-ALL[3] test set.

A.2 Simplified Transformer
The simplified transformer differs from the standard transformer in the following ways: 1. A cumulative sum-based contextualisation layer is used instead of positional encodings.
2. The attention computation is reduced to element-wise operations and has no feedforward connections.

A.2.1 Encoder
The encoder consists a contextualisation and attention layer with residual connections between both sublayers followed by layer normalisation.

A.2.2 Contextualisation
Given an input sequence S = s 1 , s 2 , ..., s L , we obtain an embeddings matrix X ∈ R D×L . The embeddings matrix X is fed into the contextualisation layer of the transformer to obtain contextual embeddingsX ∈ R D×L . We begin by taking a cumulative sum of the sequence of the embeddings X as the context matrix C = L j=1 X :j followed by a column-wise dot product between the embeddings matrix (X) and the generated context (C) to get weights w ∈ R 1×L : where j is the position of a word in the sequence X, w is a row vector and w j is a scalar at position j. The original embeddings matrix (X) is then multiplied element-wise with the weights w to obtain a contextualised representation of the sequence (X):

A.2.3 Attention
The attention takes 3 inputs: query, key and value. The output of the contextualisation layer (X) is fed in as both the query and value. The context matrix (C) is fed in as the key. The query and key are passed through a feature map to obtain Q and K respectively. The feature map (8) ensures Q and K are always positive. Therefore we can simplify softmax to the first term in the product in (13): the Energy simply scaled by the sum of the Energies.

A.2.4 Decoder
The decoder consists of two blocks. The first block is similar to the encoder block with a contextualisation layer and attention layer with residual connections between both sublayers followed by layer normalisation. The second block is another attention layer with residual connection to the previous block followed by a layer normalisation. In the second block, the output of the first block is fed in as the query and the output of the encoder block is fed in as the value. The key is the sum of the encoder's embedding matrix (X). This sum operation can be skipped by taking the last column of the encoder's context matrix (C). The decoder output (Y ∈ R D×L ) is pooled (average) and the resulting D × 1 vector is passed to a classifier. Table 7: Test accuracy of baseline systems implemented from . The mean and the standard deviations over 5 runs are reported. The difference from the Handcr.+SVM model is indicated in parentheses.