Visualizing the Relationship Between Encoded Linguistic Information and Task Performance

Probing is popular to analyze whether linguistic information can be captured by a well-trained deep neural model, but it is hard to answer how the change of the encoded linguistic information will affect task performance. To this end, we study the dynamic relationship between the encoded linguistic information and task performance from the viewpoint of Pareto Optimality. Its key idea is to obtain a set of models which are Pareto-optimal in terms of both objectives. From this viewpoint, we propose a method to optimize the Pareto-optimal models by formalizing it as a multi-objective optimization problem. We conduct experiments on two popular NLP tasks, i.e., machine translation and language modeling, and investigate the relationship between several kinds of linguistic information and task performances. Experimental results demonstrate that the proposed method is better than a baseline method. Our empirical findings suggest that some syntactic information is helpful for NLP tasks whereas encoding more syntactic information does not necessarily lead to better performance, because the model architecture is also an important factor.


Introduction
Recent years have witnessed great success of deep neural networks for natural language processing tasks, such as language modeling (Zaremba et al., 2014;Merity et al., 2018) and Neural Machine Translation (Bahdanau et al., 2015;Vaswani et al., 2017).The excellent task performance they achieved spiked the interest in interpreting their underlying mechanism.Since linguistic knowledge is crucial in natural languages, an emerging body of literature uses probes (Conneau et al., 2018;Alt et al., 2020;Saleh et al., 2020;Cao et al., 2021) to investigate whether a standard model trained * Equal contribution.Work done while J. Xiang was an intern at Tencent AI Lab.
< l a t e x i t s h a 1 _ b a s e 6 4 = " j x q v B M J l w q J p 5 J V o 5 s g Y C L x C m c k = " > A A A B 8 3 i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B H c W G Z U 1 G X R j Q s X F e w D O k P J p J k 2 N M k M S U Y o Q 3 / D j Q t F 3 P o z 7 v w b M + 0 s t P V A 4 H D O v d y T E y a c a e O 6 3 8 7 S 8 s r q 2 n p p o 7 y 5 t b 2 z W 9 n b b + k 4 V Y Q 2 S c x j 1 Q m x p p x J 2 j T M c N p J F M U i 5 L Q d j m 5 z v / 1 E l W a x f D T j h A Y C D y S L G M H G S v 6 p L 7 A Z E s y z + 0 m v U n V r 7 h R o k X g F q U K B R q / y 5 f d j k g o q D e F Y 6 6 7 n J i b I s D K M c D o p + 6 m m C S Y j P K B d S y U W V A f Z N P M E H V u l j 6 J Y 2 S c N m q q / N z I s t B 6 L 0 E 7 m E f W 8 l 4 v / e d 3 U R N d B x m S S G i r J 7 F C U c m R i l B e A + k x R Y v j Y E k w U s 1 k R G W K F i b E 1 l W 0 J 3 v y X F 0 n r r O Z d 1 s 4 f L q r 1 m 6 K O E h z C E Z y A B 1 d Q h z t o Q B M I J P A M r / D m p M 6 L 8 + 5 8 z E a X n G L n A P 7 A + f w B 7 r y R o Q = = < / l a t e x i t > L < l a t e x i t s h a 1 _ b a s e 6 4 = " A 4 P I r U M / i l C M 4 e C L z Z p i Y P + Y 6 1 8 = " > A A A B 8 n i c b V D L S g M x F M 3 U V 6 2 v q k s 3 w S K 4 K j M q 6 r L o R n c V 7 A O m Q 8 m k m T Y 0 k w z J H a E M / Q w 3 L h R x 6 9 e 4 8 2 / M t L P Q 1 g O B w z n 3 k n N P m A h u w H W / n d L K 6 t r 6 R n m z s r W 9 s 7 t X 3 T 9 o G 5 V q y l p U C a W 7 I T F M c M l a w E G w b q I Z i U P B O u H 4 N v c 7 T 0 w b r u Q j T B I W x G Q o e c Q p A S v 5 v Z j A i B K R 3 U / 7 1 Z p b d 2 f A y 8 Q r S A 0 V a P a r X 7 2 B o m n M J F B B j P E 9 N 4 E g I x o 4 F W x a 6 a W G J Y S O y Z D 5 l k o S M x N k s 8 h T f G K V A Y 6 U t k 8 C n q m / N z I S G z O J Q z u Z R z S L X i 7 + 5 / k p R N d B x m W S A p N 0 / l G U C g w K 5 / f j A d e M g p h Y Q q j m N i u m I 6 I J B d t S x Z b g L Z 6 8 T N p n d e + y f v 5 w U W v c F H W U 0 R E 6 R q f I Q 1 e o g e 5 Q E 7 U Q R Q o 9 o 1 f 0 5 o D z 4 r w 7 H / P R k l P s H K I / c D 5 / A H 8 S k W c = < / l a t e x i t > I towards better task performance also captures the linguistic information.From the perspective of information theory, Voita and Titov (2020) and Pimentel et al. (2020b) show that probes can be used to estimate the amount of linguistic information captured by a fixed model.However, the above probing only extracts linguistic information from a fixed standard model, which helps little to understand the relationship between the task performance and linguistic information encoded by the model.For example, under their methodology, it is difficult to answer the following two questions.First, would adding linguistic information be beneficial for an NLP model; second, is it harmful when this linguistic information is reduced.Therefore, it is still an open and intriguing question to reveal how task performance changes with respect to different amounts of linguistic information.
To this end, this paper proposes a novel viewpoint to study the relationship between task performance and the amount of linguistic information, inspired by the criterion of Pareto Optimality which is widely used in economics (Greenwald and Stiglitz, 1986).Our main idea is to obtain Pareto-optimal models on a test set in terms of both linguistic information and task performance and then visualize their relationship along with these optimal models.By comparing a standard model with these optimal models, it is clear to answer the question that whether adding the encoded information is helpful to improve the task performance over the standard model , as illustrated in Figure 1, where the points on the line are Pareto-optimal and the red triangle denotes the standard model with best performance.
Nevertheless, it is typically intractable to obtain the Pareto-optimal models according to both dimensions on test data.To address the challenge, we propose a principled method to approximately optimize the Pareto-optimal models on the training data which can be expected to generalise well on test sets according to statistical learning theory (Vapnik, 1999).Formally, the approach can be regarded as a multi-objective optimization problem: during the learning procedure, it optimizes two objectives, i.e., the task performance and extracted linguistic information.In addition, we develop a computationally efficient algorithm to address the optimization problem.By inspecting the trend of those Pareto-optimal points, the relationship between task performance and linguistic information can be clearly illustrated.Back to our questions, we also consider two instances within the proposed methodology: one aims to maximize the amount of linguistic information (i.e., adding) while the other tries to minimize it (i.e., reducing).
We conduct experiments on two popular NLP tasks, i.e., machine translation and language modeling, and choose three different linguistic properties, including two syntactic properties (Part-of-Speech and dependency labels) and one phonetic property.We investigate the relationship between NMT performance and each syntactic information, and the relationship between LM performance and phonetic information.For machine translation, we use LSTM, i.e., RNN-search (Bahdanau et al., 2015), and Transformer (Vaswani et al., 2017) as the main model architectures, and conduct our experiments on En ⇒ De and Zh ⇒ En tasks.For language modeling, we employ the LSTM model and conduct experiments on the Penn Treebank dataset.The experimental results show that: i) syntactic information encoded by NMT models is important for MT task and reducing it leads to sharply decreased performance; ii) the standard NMT model obtained by maximum likelihood estimation (MLE) is Paretooptimal for Transformer but it is not the case for LSTM based NMT; iii) reducing the phonetic in-formation encoded by LM models only makes task performance drop slightly.
In summary, our contributions are three-fold: 1.We make an initial attempt to study the relationship between encoded linguistic information and task performance, i.e., how the change of linguistic information affects the performance of models.2. We propose a new viewpoint from Pareto Optimality as well as a principled approach which is formulated as a multi-objective optimization problem, to visualize the relationship.3. Our experimental results show that encoding more linguistic information is not necessary to yield better task performance depending on the specific model architecture.

Related Work
Probe With the impressive performance of Neural Network models for NLP tasks (Sutskever et al., 2014;Luong et al., 2015;Vaswani et al., 2017;Devlin et al., 2019;Xu et al., 2020), people are becoming interested in understanding neural models (Ding et al., 2017;Li et al., 2019Li et al., , 2020)).One popular interpretation method is probe (Conneau et al., 2018), also known as auxiliary prediction (Adi et al., 2017) and diagnostic classification (Hupkes et al., 2018), which aims to understand how neural models work and what information they have encoded and used.From the perspective of information theory, Voita and Titov (2020) and Pimentel et al. (2020b) show that probes can be used to estimate the amount of linguistic information captured by a model.However, recent research studies point out that probes fail to demonstrate whether the information is used by models.For example, Hewitt and Liang (2019) show that the probe can also achieve high accuracy in predicting randomly generated tags, which is useless for the task.And Ravichander et al. (2021) present that the representations encode the linguistic properties even if they are invariant and not required for the task.Instead of studying the encoded linguistic information by training a probe for fixed representations, in this work we study how the amount change of linguistic information affects the performance of NLP tasks.
Information Removal Information removal is crucial in the area of transfer learning (Ganin and Lempitsky, 2015;Tzeng et al., 2017;Long et al., 2018) and fairness learning (Xie et al., 2017;Elazar and Goldberg, 2018), where people want to remove domain information or bias from learned representations.One popular method is Adversarial Learning (Goodfellow et al., 2014;Ganin and Lempitsky, 2015), which trains a classifier to predict the properties of representations, e.g., domain information or gender bias, while the feature extractor tries to fool the classifier.In this work, when using our method to reduce the linguistic information in the representations, we find that our multi-objective loss function is the same form as adversarial learning, which provides the theoretical guarantee for using adversarial learning to find the Pareto-optimal solutions to a multi-objective problem.
Recently, Elazar et al. (2020) also propose to study the role of linguistic properties with the idea of information removal (Ravfogel et al., 2020).However, the representations got by their method may not be Pareto-optimal, because it only minimizes the mutual information, but ignores the objective of task performance.On the contrary, our proposed method optimizes towards both objectives, thus our results can be used to visualize the relationship between linguistic properties and task performance.
Pareto Optimality The idea of Pareto Optimality (Mas-Colell et al., 1995) is an important criterion in economics, where the goal is to characterize situations where no variable can be better off without making at least one variable worse off.It has been also widely used in the area of sociology and game theory (Beckman et al., 2002;Chinchuluun et al., 2008).In addition, in artificial intelligence Martínez et al. (2020) use Pareto optimality to solve group fairness problem and Duh et al. (2012) proposed to optimize an MT system on multiple metrics based on the theory of Pareto optimality.In particular, Pimentel et al. (2020a) propose a variant of probing on the hidden representation of deep models and they consider Pareto optimality in terms of both objectives similar to our work.Comparing with their work, one difference is the choice of objectives.Another significant difference is that they optimize probing model in a conventional fashion, and thus are unable to study the relationship between linguistic information and task performance.

Visualizing Relationship via Pareto Optimality
We consider the relationship between linguistic information and task performance for two popular tasks in NLP, i.e., machine translation and language modeling.Let x = {x 1 , x 2 , ..., x N } be a sentence and s = {s 1 , s 2 , ..., s N } be the labels of the linguistic property of x, where s i is the label for x i , e.g., POS tag.On both tasks, a deep model typically encodes x into a hidden representation h with a sub-network E parameterized by θ e : h = E(x), and then uses another sub-network D parameterized by θ d to map h into an output.

Background
h and Loss in NMT An NMT architecture aims to output a target sentence y = {y 1 , y 2 , ..., y M } for a given source sentence x according to P (y | x; θ) (Zaremba et al., 2014;Vaswani et al., 2017), where θ indicates a set of parameters of a sequenceto-sequence neural network, which contains an encoder E and a decoder D. We define h as the output of the encoder.To train θ, the MLE loss is usually minimized on a training dataset.For NMT, the loss is defined as following: In our experiments, we consider two models, namely the LSTM (Bahdanau et al., 2015) and Transformer (Vaswani et al., 2017).
h and Loss in LM For language modeling task, a deep model typically generates a token x j based on x <j according to P (x j |x <j ; θ).Here the subnetworks E is set as one hidden layer to encode x <j into h <j and D is set as the sub-network to generate x j on top of h <j .The parameter θ is optimized by the following MLE loss: To make notations consistent for both NMT and LM, in the rest of this paper, we follow the form of Eq. ( 1) and re-write the L θ (x) in LM as L θ (x, y), where y is a shifted version of x, i.e., Encoded Information Let I(h, s) denote the linguistic information in the representation h, i.e., the mutual information between h and the linguistic label s.Since the probability p(h, s) is unknown, it is intractable to compute I(h, s).Following Pimentel et al. (2020b), we approximately estimate I(h, s) by using a probing model q as follows: where H(s) is the entropy of linguistic labels, H(s|h) is the ideal cross entropy, and L θq (h, s) is the cross-entropy loss of the probe model q parameterized by θ q .
Theory of Pareto Optimality Pareto optimality (Mas-Colell et al., 1995) is essentially involved in the multi-objective optimization problem.Suppose that we have K different objectives M k to evaluate a parameter θ , i.e., There are two important concepts in Pareto optimality as follows: Definition 1. Pareto Optimal: A parameter θ * is Pareto-optimal iff for any θ , the condition always Definition 2. Pareto Frontier: The set of all Paretooptimal parameters is called the Pareto frontier.

Viewpoint via Pareto Optimality
Motivation Suppose θ is a given model parameter, L(θ) is its task performance on a test set, and I(θ) is the amount of linguistic information encoded in its hidden representation.Conventionally, if one can figure out a function f such that I = f (L) for any θ, it is trivial to study their relationship by visualizing f .Unfortunately, for some complicated situations as illustrated in Figure 1, there does not exist such a function to represent the relationship between two variables due to a large number of many-to-many correspondences.
Our Viewpoint Pareto Optimality, a well-known criterion in economics (Mas-Colell et al., 1995), is widely used to analyze the relationship among multiple variables in a complicated environment (Chinchuluun et al., 2008).In our context, it is also a powerful tool to reveal the relationship between the encoded linguistic information and task performance.Taking the Pareto Frontier in Figure 1 as an example, since the capacity of a model is fixed and linguistic information may compete with other kinds of information, capturing more linguistic information may reduce the amount of information from other sources that are also helpful for the model.Nevertheless, if increasing the amount of linguistic information constantly leads to performance gain, i.e., linguistic information is complimentary to translation, only one Pareto Optimal point would exist on the top right corner.Therefore, in this paper, we propose to study the relationship between I(θ) and L(θ) from the viewpoint of Pareto Optimality.Our key idea is to take into account only Pareto-optimal models rather than all models like the conventional method.Thanks to the definition of Pareto optimality, there are no many-to-many correspondences between two variables along the Pareto frontier.Hence their relationship can be visualized by the trend of these frontier points, as shown in Figure 1.Taking Figure 1 as an example, to answer the questions mentioned before, we can see that adding more information is possible to increase the task performance comparing with a standard model.According to this viewpoint, the core challenge is how to obtain a set of models which are Pareto optimality on a test dataset.
It is natural to employ a heuristic method to approximately obtain the Pareto-optimal models as following.We can first randomly select a number of checkpoints during the standard training and probe each checkpoint by optimizing its corresponding probing model q, as shown in Eq (2).Second, we can record the task performances and the amounts of linguistic information of the selected models on a test set.Finally, we can find the Pareto-optimal points and obtain the Pareto frontier.However, when using this method in our experiments, we find the amounts of encoded linguistic information for all checkpoints are similar and the the task performances of those checkpoints are worse than the optimal model.Hence, in the next section, a new method will be presented to approximately derive the Pareto-optimal models.

Multi-Objective Optimization
To study the relationship between linguistic information and task performance, our goal is to obtain a set of models θ which are Pareto optimal on test data in terms of both objectives.Inspired by statistical learning theory (Vapnik, 1999), we propose an approach by optimizing the Pareto-optimal models towards both objectives on a given training dataset, which are expected to generalize well on unseen test data, i.e., these models are Pareto optimal on unseen test data.Formally, Our approach can be formulated as the following multi-objective optimization problem: where minimizing L θ (x, y) aims to promote the task performance and maximizing I(h, s) encourages a model to encode more linguistic information in the representation.Once we obtain a set of Pareto-optimal models, we can observe how increasing the encoded linguistic information affects the variance of task performance.
To further study how reducing the encoded linguistic information affects task performance, we optimize a similar multi-objective problem: The only difference between Eq. ( 4) and Eq. ( 5) is that the former maximizes I(h, s) while the latter minimizes I(h, s).Since H(s) is a constant term, we can plug Eq. (2) into the above two equations and obtain the following reduced multi-objective problems: Notice that in the above equations, min θq L θq (h, s) resembles a conventional probing if h is a fixed representation.However, unlike the standard probing applied on top of a fixed h determined by the standard model, here h is the representation obtained from a encoder E parameterized by θ e .It is also worth noting that the Pareto frontiers obtained from the Eq. ( 6) and ( 7) are independent, although they have a similar measurement, because the Pareto Optimal is only valid for the same objective.

Optimization Algorithm
To solve the above multi-objective problems, we leverage the linear-combination method to find a set of solutions, and then filter the non-Pareto-optimal points from the set to get the Pareto frontier.The details of our algorithm are shown below.
Optimization Process Since the detailed optimization method for Eq. ( 6) is similar to that for Eq. ( 7), in the following we take Eq. ( 6) as an example to describe the optimization method.Inspired by (Duh et al., 2012), we employ a two-step strategy for optimization to find the Pareto frontier to < l a t e x i t s h a 1 _ b a s e 6 4 = " o 3 Encoder ✓ e < l a t e x i t s h a 1 _ b a s e 6 4 = " r l f P S p 8 S r r u b a x Z Y + v F e r c + 5 q 0 F K 5 8 5 h D + w P n 8 A j q 6 S n A = = < / l a t e x i t >

Inputs
< l a t e x i t s h a 1 _ b a s e 6 4 = " x a G P s B J O F R P B n X x e S S y Q E Z e c k Q q 5 J l V S I 5 w k 5 J m 8 k j f r y X q x 3 q 2 P W W v O y m b 2 y B 9 Y n z + d w J M P < / l a t e x i t >

GM Layer
< l a t e x i t s h a _ b a s e = " h H y g p Y y a y t w Y y S C g w D n K S K g I = " > A A A C A H i c b V B N S N A E N U r q / q h e F k s g q e S i K L H o h e P F e w H N C Probe ✓ q < l a t e x i t s h a 1 _ b a s e 6 4 = " u e o C S g a + z u i P R 8 W + V 8 t K i I S M 9 P L ✓ < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 i R f z i p O 8 + 8 w Y G y 2 a L 7 P j i s y M w U = " > A A A B 8 3 i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 B I v g q S S i 6 L ✓q < l a t e x i t s h a 1 _ b a s e 6 4 = " V a y 3 r 8 a j 8 W y 8 G e / T 1 g W j n N l D f 2 R 8 f A P + H Z 9 p < / l a t e x i t > @L✓ @✓d < l a t e x i t s h a 1 _ b a s e 6 4 = " Figure 2: Overview of our multi-objective optimization method, where L y = L θ (x, y) and L θq = L θq (h, s).
address the multi-objective problems.
In the first step, we adopt an method to find the Pareto-optimal solutions to the problem.There are several different methods to solve the problem, such as linear-combination, PMO (Duh et al., 2012) and APStar (Martínez et al., 2020).In this work, we adopt the linear-combination method because of its simplicity.Specifically, we select a coefficient set {λ k | λ k > 0} and minimize the following interpolating function for each coefficient Notice that the first term of the loss function L θ (x, y) is the function of both encoder parameters θ e and decoder parameters θ d , while the second term min θq L θq (h, s) is only the function of θ e .Therefore, when minimizing Eq.( 8), we apply a Gradient-Multiple (GM) Layer on the representations before inputting it into the probe model.As shown in Fig. 2, in the forward propagation, the GM Layer acts as an identity transform, while in the backward propagation, the GM Layer multiples the gradient by ±λ and passes it to the preceding layers.Note that when the multiplier is −λ, the GM Layer is the same as Gradient Reversal Layer (Ganin and Lempitsky, 2015).
is the minimized solution set for Eq. ( 8).In the second step, to get more accurate solutions, we filter the non-Pareto-optimal points of the solution set obtained by {θ * k | θ * k > 0}.Finally, we get the Pareto frontier to the multiobjective problem according to the definition of Pareto optimality.

Algorithm 1 Optimization Algorithm
Input: Λ = {λ k }, learning rate η Output: Pareto frontier set P = { θ i e , θ i d θ i q } 1: M = {} empty model set 2: for λ k ∈ Λ do minimize Eq. ( 8) 3: Random initialize θ k e , θ k d , and θ k q 4: while convergence do 5: ) +λ k is for Eq. ( 6) and changing it to −λ k would optimize Eq. ( 7) 6: end if 17: end for Detailed Algorithm The overall optimization algorithm regarding to Eq. ( 6) is shown in Algorithm 1. Theoretically, when minimizing Eq. ( 8), in every step updating θ, we should retrain the probe model θ q to minimize L θq (h, s) in for many steps, in order to estimate H(s|h) precisely.However, this is time-consuming and inefficient.Instead, after updating θ, we update θ q only by one step (see line 7 Algorithm 1).Empirically, we find that optimization in this way has been very effective.
In addition, as is mentioned by Elazar and Goldberg (2018), information leakage may occur when minimizing the mutual information.Therefore, after the training process is finished, we fix the deep model and retrain another probe model to estimate H(s|h) more precisely (line 9 in Algorithm 1).When maximizing the mutual information, we find there is no difference between H(s|h) estimated by jointly trained or retrained probe models.

Dataset
We conduct experiments on both machine translation and language modeling tasks.For machine in our preliminary experiments.translation, we conduct the experiments on En ⇒ De and Zh ⇒ En translation tasks.For En ⇒ De task, we use WMT14 corpus which contains 4M sentence pairs.For Zh ⇒ En task, we use LDC corpus which consists of 1.25M sentence pairs, and we choose NIST02 as our validation set, and NIST06 as our test set.For language modeling task, we use Penn Treebank2 dataset.We preprocess our data using byte-pair encoding (Sennrich et al., 2016) and keep all tokens in the vocabulary.For machine translation, we use case-insensitive 4-gram BLEU score (Papineni et al., 2002) to measure the task performance, which is proved to be positively correlated well with the MLE loss (?);For language modeling, we directly use the MLE loss to evaluate the task performance.

Linguistic Properties
For machine translation, we study part-of-speech (POS) and dependency labels in this work.Since there are no gold labels for the MT datasets, we use Stanza toolkit3 (Qi et al., 2020) to annotate source sentences and use the pseudo labels for running our algorithm, following Sennrich and Haddow (2016); Li et al. (2018).We clean the labels and remove the sentences that fail to be parsed by Stanza from the dataset.To study whether all kinds of linguistic information are critical for neural models, we also investigate the phonetic information on the language modeling task.More precisely, the probing model needs to predict the first character of the International Phonetic Alphabet of each word. 4We get the labels with the open source toolkit English-to-IPA5 .We use mutual information I(h, s) = H(s) − H(s|h) to evaluate the amount of information in the representations.Since H(s) is a constant, we only compare H(s|h) in the experiments.Note that H(s|h) is estimated by our probe model q.

Implementation Details
All of our models are implemented with Fairseq6 (Ott et al., 2019).For NMT experiments, our LSTM model consists of a bi-directional 2-layer encoder with 256 hidden units, and a 2-layer decoder   with 512 hidden units, and the probe model is a 2layer MLP with 512 hidden units.Our Transformer model consists of a 6-layer encoder and a 6-layer decoder, whose hyper-parameters are the same as the base model in (Vaswani et al., 2017)

Experiment Results
In the following experiments, "Model + Property", e.g., "Transformer+Pos", which is corresponding to Eq. 4 and studies how adding the linguistic properties information affects the task performance.Instead, "Model -Property", e.g., "Transformer-Pos", which is corresponding to Eq. 5 and studies how removing the linguistic properties information affects the task performance.It is worth noting that merging the two frontiers of + Property and -Property together would lead to trivial results, because Pareto Optimal points of the + Property are more likely to dominate.However, we think the frontier of -Property is helpful for answering the question that whether reducing the encoded linguistic information would affect the model performance.Therefore, we plot the Pareto frontiers for the two objectives independently.

Soundness of Methodology
The heuristic method mentioned before can be considered as a simple and straightforward baseline method to measure the relationship.To set up this baseline, we firstly save some checkpoints every 1,000 steps when training a standard model.Second, we randomly sample 30 checkpoints for probing and plot a scatter diagram in terms of BLEU and encoded linguistic information.As shown in Figure 4, we compare our proposed method with the heuristic method in the setting of "Transformer+Pos".Comparing with the baseline method, the frontier obtained from our method is better: for each model explored by baseline, there exists at least one model explored by our method whose two objectives, i.e., encoded linguistic information and BLEU score, are larger.The main reason is that the objective of baseline method only considers the task performance and most checkpoints contain similar encoded linguistic information.Therefore, the models optimized by our multi-objective method is more close to the globally Pareto-optimal points7 , making the  revealed relationship between encoded linguistic information and task performance more reliable.Therefore, in the next subsection, our proposed method will be used to visualize the relationship between encoded linguistic information and task performance for neural models.

Visualization Results
Results on NMT The results of machine translation on the WMT dataset are shown in Figure 3.For LSTM based NMT, we observe that the standard model, i.e., the in Figure 3, is not in the Pareto frontier in Figure 3 (a,c).In other words, when adding linguistic information into the LSTM model, it is possible to obtain a model which contains more POS or DEP information and meanwhile leads to better BLEU score than the standard model by standard training.In contrast, for Transformer based NMT, the standard model is in the Pareto frontier, as shown in Figure 3 (e,g).This finding provides an explanation to the fact in NMT: many efforts (Luong et al., 2016;Nȃdejde et al., 2017;Bastings et al., 2017;Hashimoto and Tsuruoka, 2017;Eriguchi et al., 2017) have been devoted to improve the LSTM based NMT architecture by explicitly modeling linguistic properties, but few have been done on Transformer based NMT (McDonald and Chiang, 2021;Currey and Heafield, 2019).In addition, when removing the linguistic information from LSTM or Transformer, the standard model is very close to the lower right of Pareto frontier, or even at the frontier, as shown in Figure 3 (b,d,f,h).This result shows that removing linguistic informaon the training data.Although the globally Pareto-optimal solutions are unknown, our solutions are definitely more close to them than the solutions by baseline.tion always hurts the performance of NMT models for both LSTM and Transformer, indicating that encoding POS and DEP information is important for NMT task.Similar trends are observed on the LDC datasets, as shown in Figure 5.More details about the effect of randomness on our approach are shown in appendix B.
Results on LM Above experiments have shown that both syntactic information are important for NMT models, and then a natural question is whether all kinds of linguistic information are important for neural models.To answer this question, we propose to investigate the influence of phonetic information on a language model.Figure 6 depicts the relationship between encoded phonetic information and task performance for an LSTM based language model.In Figure 6 (a), we find that the performances of Pareto-optimal models drop slightly when forcing an LSTM model to encode more phonetic information.Besides, as the Pareto-frontier shown in Figure 6 (b), removing phonetic information from an LSTM model only leads to a slight change in performance.These experiments demonstrate that the encoded phonetic information may be not that critical for an LSTM based language model.This finding suggest that not all kinds of linguistic information are crucial for LSTM based LM and it is not promising to further improve language modeling with phonetic information.

Conclusion
This paper aims to study the relationship between linguistic information and the task performance and proposes a new viewpoint inspired by the criterion of Pareto Optimality.We formulate this goal as a multi-objective problem and present an effective method to address the problem by leveraging the theory of Pareto optimality.We conduct experiments on both MT and LM tasks and study their performance with respect to linguistic information sources.Experimental results show that the presented approach is more plausible than a baseline method in the sense that it explores better models in terms of both encoded linguistic information and task performance.In addition, we obtain some valuable findings as follows: i) syntactic information encoded by NMT models is important for MT task and reducing it leads to sharply decreased performance; ii) the standard NMT model obtained by minimizing MLE loss is Pareto-optimal for Transformer but it is not the case for LSTM based NMT; iii) reducing the phonetic information encoded by LM models only leads to slight performance drop.

Figure 1 :
Figure 1: Illustration of Pareto frontier by a toy example.Triangle ( ) corresponds to the standard checkpoint with best performance and each circle ( ) corresponds to a sampled checkpoint.The y-axis indicates the linguistic information I encoded by the model, and x-axis indicates the negative loss value −L.
l 0 j y r e h d V 9 / 6 8 U r v O 4 y i S I 3 J M T o l H L k m N 3 J I 6 a R B O J H k m r + T N e X R e n H f n Y 9 5 a c P K Z Q / I H z u c P 9 r C P 6 w = = < / l a t e x i t > H J F j c k p c c k k a 5 J Y 0 S Y s w k p J n 8 k r e r M x 6 s d 6 t j 3 l r y S p m D s k f W J 8 / T W S R 2 w = = < / l a t e x i t >

Figure 3 :
Figure 3: Experiments on WMT14 corpus.Triangle ( ) denotes the model trained by minimizing MLE loss, circle ( ) denotes the models obtained by our method, and the models on the line (-) denotes the Pareto frontier.

Figure 4 :
Figure 4: Comparison with baseline method.Triangle ( ) denotes the standard model by minimizing MLE loss.The green line and blue line are frontiers got from baseline method and our method respectively.

Figure 5 :
Figure5: Experimental results on LDC corpus.The format is the same as Fig.3

Figure 6 :
Figure 6: Experimental results on the PTB dataset.