A Simple Geometric Method for Cross-Lingual Linguistic Transformations with Pre-trained Autoencoders

Powerful sentence encoders trained for multiple languages are on the rise. These systems are capable of embedding a wide range of linguistic properties into vector representations. While explicit probing tasks can be used to verify the presence of specific linguistic properties, it is unclear whether the vector representations can be manipulated to indirectly steer such properties. For efficient learning, we investigate the use of a geometric mapping in embedding space to transform linguistic properties, without any tuning of the pre-trained sentence encoder or decoder. We validate our approach on three linguistic properties using a pre-trained multilingual autoencoder and analyze the results in both monolingual and cross-lingual settings.


Introduction
Recently, the design of sentence encoders, monolingual (Kiros et al., 2015;Conneau et al., 2017) and multilingual (Artetxe and Schwenk, 2019;Feng et al., 2020) has enjoyed a lot of attention. Many works have used probing tasks to investigate the presence of specific linguistic properties in sentence representations (Adi et al., 2016;Conneau and Kiela, 2018;Ravishankar et al., 2019;Hewitt and Manning, 2019;Chi et al., 2020). However, it remains unclear to what extent these linguistic properties can be actually steered by manipulating the representations. By analogy to the definition of style-transfer from Li et al. (2018), we refer to modifying a particular linguistic property in a given text (e.g., a sentence's tense) while preserving all of the property-independent content as linguistic property transfer.
Training dedicated models to transfer linguistic properties requires substantial computational effort and a lot of training data. Adding the ability to transform a new property may require an entire retraining of the text encoder and decoder. This is especially challenging for low-resource languages or when reusing or building transfer models for more than one language.
Assuming that pre-trained autoencoders capture the linguistic properties of interest, we investigate (i) whether they can be used without further tuning to efficiently transfer the properties, and (ii) whether this extends to the cross-lingual setting, when based on a multilingual pre-trained autoencoder. Our starting point is a pre-trained sentence encoder, with a corresponding decoder trained on an autoencoder objective. We show how a geometric transformation of pre-trained multilingual sentence embeddings can be efficiently learned on CPU for transferring specific linguistic properties. We also experiment with cross-lingual linguistic property transfer, using a language-agnostic pretrained encoder.
In summary, this paper presents a set of preliminary experiments on linguistic property transfer, and shows that there may be value in further research on manipulating distributed representations to efficiently tackle language generation tasks.

Related work
Linguistic properties usually denote the grammatical behavior of linguistic units in sentences. This contrasts with styles which concern semantic aspects of sentences such as sentiment and gender. Nevertheless, transferring linguistic properties can be situated in the broader style transfer setting.
Style transfer systems can be categorized into (i) methods that learn disentangled representations, in which the content is explicitly separated from the style, making the style aspect controllable and interpretable (Hu et al., 2017;Shen et al., 2017;Fu et al., 2018;Logeswaran et al., 2018;John et al., 2019) and (ii) methods that learn entangled representations in which the content and style are not explicitly separated (Mueller et al., 2017;Dai et al., 2019;Liu et al., 2020; Figure 1: (a) Pretrained autoencoder (encoder ENC, decoder DEC). (b) linguistic property classifier C. (c) Geometric transformation of the sentence representation to shift z according to λ beyond the decision boundary of C, the shifted encoding z is then given as input to the decoder resulting in the sentence x with the transferred property. 2019; Duan et al., 2020). Our approach falls under the entangled methods because encoder-decoder systems trained on an autoencoding objective yield representations in which there is no explicit separation between content and style. Conceptually, our method is most similar to Duan et al. (2020), but differs in (i) that it can use any existing pre-trained autoencoder as opposed to training an autoencoder from scratch on a variational objective, (ii) that a simple geometric transformation is applied on the representations instead of training a computational heavy neural transformation network, and (iii) that it generalizes to the cross-lingual setting.

Linguistic Property Transfer
Our system consists of three components: (1) a pretrained multilingual autoencoder, (2) linear classifiers for the targeted linguistic properties and (3) a component that geometrically transforms sentence embeddings to transfer the selected properties in the dense sentence representation space. These components are presented schematically in Fig. 1.
We start from a pre-trained autoencoder ( Fig. 1a) that consists of an encoder (ENC : X → R n ) which maps sentences (x ∈ X ) to vectors (z ∈ R n ), and a decoder (DEC : R n → X ) that maps the vectors z back to the corresponding sentences.
The second component (Fig. 1b) is a linear classifier C : R n → Y that takes as input a sentence encoding z and outputs a linguistic property label. We will limit our experiments to binary properties, i.e., Y = {0, 1}.
Finally, the last component ( Fig. 1c), performs a geometric transformation. It allows flipping the value of the selected linguistic property by projecting the original encoding z into the opposite half-space with respect to the property classifier, over an estimated distance λ. This leads to the transferred encoding z , designed to be decoded into a sentence x close to the original sentence, but with the transformed target property.
The three components shown in Fig. 1 are further described below.

Pretrained Autoencoder
For the pre-trained autoencoder shown in Fig. 1a, we use Language Agnostic Sentence Representations (LASER) (Artetxe and Schwenk, 2019). LASER encodes sentences of 93 languages into a single vector space, such that semantically similar sentences in different languages have similar vectors. For our experiments, we leave the LASER encoder unchanged and train separate decoders for English and Dutch, by optimizing the likelihood p(x|z), with z = ENC(x). The decoder consists of a single-layer 1024-dimensional hidden state LSTM (Hochreiter and Schmidhuber, 1997).

Linear Property Classifier
Our approach assumes that both labels of the considered property are linearly separable in z space. A linear classifier C is trained on examples of the linguistic property. With the coefficients w ∈ R n and bias b ∈ R, its decision boundary is characterized by the affine hyperplane (1) Logistic regression was used for the results presented in this work.

Geometric Transformation
The idea behind the geometric transformation is the following: a perpendicular projection from z onto the decision plane H would make the classifier C most uncertain about the considered attribute, with minimal changes (in Euclidean sense) to the original vector. When removing the property information from the corresponding sentence with the opposite label, we assume it gets projected onto the same position of H. As a result, the proposed geometric transformation comes down to shifting z in the direction perpendicular to H, and beyond it, into the region where C would predict the opposite label of the property. The transformed representation z is then decoded by DEC. The intuitive approach of simply mirroring z over the decision plane appears sub-optimal (see Section 4.4). The distance into the opposite half space is therefore predicted based on the input (see Section 3.4).
The geometric shift of z in the direction of H can be derived with basic geometry, for which what follows is a brief sketch. By construction, w is perpendicular to the plane described by z · w = 0, which in turn is parallel to H, given Eq. (1), such that w ⊥ H. With that, the perpendicular projection z ⊥ of z onto H can be written as after substituting z ⊥ ∈ H into Eq. (1). We finally express the transformation of z onto z beyond H as where the parameter λ ≥ 0 represents the distance of z from H, relative to the distance ||z ⊥ − z|| on the original side of the decision plane (indicated as d ⊥ in Fig. 1).

Projection Distance Predictor
As mentioned above, we propose estimating the most suitable value of λ, corresponding to how far on the other side of the decision plane z needs to be projected to get optimal transfer results. To that end, we use a contextual multi-armed bandit (CMAB) (Auer, 2002), a simple and efficient form of reinforcement learning with a single state, which in our setting is the sentence representation z. For a new input z, the bandit needs to select the value of λ that best allows transferring the property with Eq.
(2), while preserving the content of the associated sentence x. The bandit method allows using a nondifferentiable reward, but other choices of algorithm are possible. Our model's goal is to preserve the content of the original sentence x while changing its property y to y . Hence, our CMAB reward consists of (i) a linguistic property reward r prop and (ii) a content-preserving reward r content . To compute r prop , we pass the decoded transformed sentence x = DEC(z ) back into the encoder and use the predicted likelihood of the corresponding linear property classifier for target y as the reward: with σ(.) the logistic function. For r content , we directly optimize the BLEU-score (Papineni et al., 2002) between the original sentence x and the transferred sentence x . Intuitively, this leads to the minimum number of changes that are required to transfer y to y and thus encourages the content preservation between x and x : For the final reward r(x, x , y ), the harmonic mean of r prop (x , y ) and r content (x, x , y ) appeared a suitable choice, encouraging the model to jointly ensure the correct target property (high r prop ) as well as preserve the sentence content (high r content ).
We implement CMAB using the LinUCB with Disjoint Linear Models algorithm from Li et al. (2010), which assumes that the expected reward obtained from choosing arm λ is linear with respect to its input features (in our case, sentence encoding z). For each discrete allowed value ('arm') for λ, LinUCB learns a separate ridge-regression model, with learnable parameters A ∈ R n×n and b ∈ R n (for LASER n is 1024). It predicts the reward, including an upper confidence bound (UCB), for choosing that value for λ for the given encoding z. The hyperparameter α is used to control the wideness of the UCB, such that a larger α results in a wider UCB. Each training iteration observes a single z for which the arm achieving the highest potential reward (UCB) is chosen and only the parameters corresponding to its ridge-regression model are updated. Quantifying the merit of each arm for the input requires an inverse-matrix (n × n) computation, 2 matrix-vector multiplications, and 2 dot products. The best arm's parameters (A and b) are then updated, requiring 1 outer-vector product. During inference, the λ value of the best arm is used. The training and inference schemes are presented in Algorithms 1 and 2.

Experiments
To investigate whether linguistic properties embedded in representations of pre-trained encoders can be transferred without finetuning, we first apply the SentEval tool from Conneau and Kiela (2018) to LASER-embeddings (Section 3.1) and identify three properties that have a strong presence. We then investigate how well our approach performs on these properties in the monolingual setting (ML), in which our CMAB model is both trained and evaluated on English sentences (Q1). Finally, we investigate the performance of our approach in the crosslingual setting (CL), in which the model is trained on English but evaluated on Dutch sentences. In particular, after training on English, Dutch sentences are passed into the LASER encoder to obtain the transformed encodings z which in turn are decoded by the Dutch decoder (Q2).  Table 1: Results of LASER-embeddings on the probing tasks of Conneau and Kiela (2018). In our experiments, we transfer the properties denoted in bold.
. Table 1 shows the results of LASER-embeddings on the probing tasks from Conneau and Kiela (2018). The high accuracies for the properties shown in bold, indicate that LASER encodes them well. In our experiments, we transfer (i) the Tense of the main verb which is either in the present or past, (ii) ObjNum, representing the number (singular or plural) of the main clause's direct object and (iii) SubjNum, which is the number (singular or plural) of the subject of the main clause.

Implementation and Training Data
As discussed in Section 3.1, we use LASER's encoder and train two decoders on it with around 20M English and Dutch OpenSubtitles sentences (Tiedemann, 2012;Lison et al., 2019). For each property, we train a binary logistic regression model on CPU using SentEval data, through stratified 5fold cross-validation. We found that training the CMAB-models on SentEval led to worse results than training on OpenSubtitles. We hypothesize that this is due to a mismatch between the SentEvaland OpenSubtitles sentences on which the decoders were trained. We therefore trained, on CPU, the CMAB-models using 2500 English OpenSubtitles sentences with (noisy) property labels predicted by the SentEval classifiers. Across all experiments, we use the discrete set {1, 1.5, . . . , 7} as possible values for λ ('arms' of the CMAB algorithms) and set the CMAB exploration parameter α to 4.

Evaluation
We randomly selected OpenSubtitles sentences (not seen during decoder training), and for those with any of the target properties present, annotated the corresponding sentence with the flipped property. As such, 100 test-pairs (x, x ) were obtained for each property. We report human evaluation metrics: (i) the percentage of transferred sentences that have the correct property ('Label' accuracy), and (ii) the  Table 2: Human label accuracy ('Label') and accuracy of both label and content ('All'), and BLEU-scores of our CMAB-approach (monolingual and cross-lingual).
. percentage of transferred sentences that have the correct property and preserve the content ('All' accuracy). We also include the BLEU-score between the transferred sentence and the gold target x .

Results
To answer (Q1), we refer to the first three rows of Table 2. Our approach switches properties in roughly half of the cases (label accuracy). However, fewer cases occur in which both the property is transferred and content is preserved. The last three rows of Table 2 display the metrics in the cross-lingual setting in which we notice similar results as in the previous setting (Q2). The results are encouraging, although we expect further improvements from more complex transformation approaches. Table 3 shows, for Tense ML , a comparison of our CMAB approach against a baseline, that mirrors each z over the decision boundary i.e, λ = 1. We find that the CMAB-approach outperforms that baseline for all metrics. Moreover, Table  4 shows the distribution of the predicted arms on the test sets in the monolingual and cross-lingual settings, indicating that choosing the optimal value for λ is input-dependent. As an illustration, Table 5 lists a few examples, picked randomly from among those test items with successful label transformation and content preservation.

Conclusion and Future Work
We have introduced a simple and efficient geometric method to transfer linguistic properties which has been evaluated on three properties in both monolingual and cross-lingual settings. While there is room for improvement, our preliminary results indicate that it can allow pre-trained autoencoders to transfer linguistic properties without additional tuning, such that there is no need to train dedicated transfer systems. This potentially makes learning faster and better scalable than with   (7) 3(1) 3.5 4 1.5(3) 21() 4.5 (29.5) 5 5.5(6.5) 1.5(5.5) 5.5 21(26) 6 24(11.5) 0.5(50) 6.5 8.5 (13) 22 (15)  7 17(12.5) 63(15.5) Table 4: Distributions of the predicted projection distances of the CMAB for the different test sets expressed as a percentage (monolingual and cross-lingual). .

Tense (present→past)
Mono-i ask many people here . lingual i asked many people here .
Cross-ik kijk naar een oude film van m ' n moeder . lingual ik bekeek een oude film van mijn moeder .
ObjNum (singular→plural) Mono-i could tell you some story . lingual i could tell you some stories .
SubjNum (plural→singular) Mono-families agreed to keep it quiet . lingual a family agreed to keep it quiet .
Cross-monsters gaan ons opeten . lingual het monster gaat ons opeten . Table 5: Linguistic property transfer examples of the proposed system in both monolingual and cross-lingual settings existing methods. For future work, we aim at extending our method to transformer-based encoders (monolingual and cross-lingual), and will consider additional linguistic as well as more style-oriented properties.