RevUp: Revise and Update Information Bottleneck for Event Representation

The existence of external (“side”) semantic knowledge has been shown to result in more expressive computational event models. To enable the use of side information that may be noisy or missing, we propose a semi-supervised information bottleneck-based discrete latent variable model. We reparameterize the model’s discrete variables with auxiliary continuous latent variables and a light-weight hierarchical structure. Our model is learned to minimize the mutual information between the observed data and optional side knowledge that is not already captured by the new, auxiliary variables. We theoretically show that our approach generalizes past approaches, and perform an empirical case study of our approach on event modeling. We corroborate our theoretical results with strong empirical experiments, showing that the proposed method outperforms previous proposed approaches on multiple datasets.


Introduction
In this work, we are interested in addressing limitations in how computational event modeling can make use of relevant, supplementary semantic knowledge.This is because when modeling text descriptions of complex situations, such as newspaper descriptions of real world events, learning how to encode richer information about those descriptions can be a fruitful way of improving modeling performance (Judea and Strube, 2015;Xia et al., 2021).E.g., if we are dealing with sequences of events, like a newspaper report of a stock or commerce transaction, then being able to encode that "buying" or "selling," even though they may have nuanced connotation differences, are instances of the same general event (a TRANSACTION event) can improve downstream predictive performance on what events might happen next (Ferraro and Van Durme, 2016;Rezaee and Ferraro, 2021).
< l a t e x i t s h a 1 _ b a s e 6 4 = " W e B N E x E P x w / e o x Z u X K q k O v 8 u 6 T s = " > A A A B + H i c b V D L S g M x F M 3 U V 6 2 P j r p 0 E y y C q z I j o i 6 L b l x W s Q 9 o h 5 J J b 9 v Q J D M k m U I d + i V u X C j i 1 k 9 x 5 9 + Y a W e h r Q c C h 3 P u 5 Z 6 c M O Z M G 8 / 7 d g p r 6 x u b W 8 X t 0 s 7 u 3 n 7 Z P T h s 6 i h R F B o 0 4 p F q h 0 Q D Z x I a h h k O 7 V g B E S G H V j i + z f z W B J R m k X w 0 0 x g C Q Y a S D R g l x k o 9 t 9 w V x I y U S B 9 g w j T M e m 7 F q 3 p z 4 F X i e Y J 7 c z V 3 x s 5 5 k p N e W g m i 5 h q 2 S v E / 7 x + p u P r I K c i z T Q I s j g U Z 8 z R i V O 0 4 E R U A t F s a g g m k p q s D h l j i Y k 2 X d V M C d 7 y l 1 d J 5 7 z p X T a 9 + 4 t G 6 6 a s o 4 q O 0 Q k 6 Q x 6 6 Q i 1 0 h 9 r I R w R l 6 B m 9 o j f r y X q x 3 q 2 P x W j F K n e O 0 B 9 Y n z 9 U B p O H < / l a t e x i t > Update < l a t e x i t s h a 1 _ b a s e 6 4 = " V s g i c m e 2 b H r S D s L e 7 T 6 8 Z U T 8 3 5 o = " > A A A C R X i c b V C 7 T s M w F H V 4 U 1 4 F R h a L g t Q K q B I G Y E F C w M A I E g W k p q p u X L e 1 s J P I v k G U k J 9 j Y W f j D 1 g Y Q I g V n N K B 1 5 U s H Z 1 z 7 s M n i K U w 6 L q P z t D w y O j Y + M R k Y W p 6 Z n a u O L 9 w Z q J E M 1 5 j k Y z 0 R Q C G S x H y G g q U / C L W H F Q g + X l w e Z D r 5 1 d c G x G F p 9 i L e U N B J x R t w Q A t 1 S z 6 K 3 E z 9 T u g F G R l v L 1 e v 6 n Q X e p 3 A V N f 2 j E t y K j + 7 r D 6 G i 1 7 G z 8 s F Z p P U Y B d B j I 9 z P q j K i v N Y s m t u v 2 i f 4 E 3 A C U y q O N m 8 c F v R S x R P E Q m w Z i 6 5 8 b Y S E G j Y J J n B T 8 x P A Z 2 C R 1 e t z A E x U 0 j 7 a e Q 0 V X L t G g 7 0 v a F S P v s 9 4 4 U l D E 9 F V h n f q n 5 r e X k f 1 o 9 w f Z O I x V h n C A P 2 d e i d i I p R j S P l L a E 5 g x l z w J g W M i Q j u k z W q W K i K Z q S e j I w d 4 z y p N 3 A q 1 f Q r w S P 0 9 k R B p T E 8 G N i k J 3 J p J b y j + 5 9 V i a B 3 V E 6 6 i G J i i P 4 t a s c A Q 4 m F j u M k 1 o y B 6 l h C q u f 0 r p r d E E w q 2 1 5 Q t w Z s 8 e Z p U i g X v o O B d F T M n p + M 6 l t A O 2 k U 5 5 K F D d I L O U Q m V E U U P 6 A m 9 o F f n 0 X l 2 3 p 2 P n + i M M 5 7 Z R n / g f H 0 D O k G l J g = = < / l a t e x i t > min KL p (t|x, z)||r (t|z) < l a t e x i t s h a 1 _ b a s e 6 4 = " k w U 3 D e R 6 m v 2 i U K t w a B a i q c 1 c M 7 M = " > A A A B / X i c d V D L S s N A F J 3 4 r P U V H z s 3 g 0 W o m 5 I U b e O u q A u X F e w D 2 h A m 0 0 k 7 d P J g Z i L W G P w V N y 4 U c e t / u P N v n L Q R V P T A w O G c e 7 l n j h s x K q R h f G h z 8 w u L S 8 u F l e L q 2 v r G p r 6 1 3 R Z h z D F p 4 Z C F v O s i Q R g N S E t S y U g 3 4 g T 5 L i M d d 3 y W + Z 1 r w g U N g y s 5 i Y j t o 2 F A P Y q R V J K j 7 0 Z O 0 v e R H G H E k v M 0 L c u 7 m 0 N H L x k V o 2 Y e W x Z U Z A p F T q x a t V q H Z q 6 U Q I 6 m o 7 / 3 B y G O f R J I z J A Q P d O I p J 0 g L i l m J C 3 2 Y 0 E i h M d o S H q K B s g n w k 6 m 6 V N 4 o J Q B 9 E K u X i D h V P 2 + k S B f i I n v q s k s q P j t Z e J f X i + W n m U n N I h i S Q I 8 O + T F D M o Q Z l X A A e U E S z Z R B G F O V V a I R 4 g j L F V h R V X C 1 0 / h / 6 R d r Z i 1 i n l 5 V G q c 5 n U U w B 7 Y B 2 V g g j p o g A v Q B C 2 A w S 1 4 A E / g W b v X H r U X 7 X U 2 O q f l O z v g B 7 S 3 T x v L l a k = < / l a t e x i t > p D (t|x) < l a t e x i t s h a 1 _ b a s e 6 4 = " x P S L 6 + Q U 5 Z T q 7 J g p F 9 C X F 6 9 Z 0 u 0 = " > A A A B + H i c d V D L S s N A F J 3 U V 6 2 P R l 2 6 G S x C 3 Y S k t N X u i m 5 c V r A P a E O Y T K f t 0 J k k z E y E N v Z L 3 L h Q x K 2 f 4 s 6 / c d J G U N E D F w 7 n 3 M u 9 9 / g R o 1 L Z 9 o e R W 1 v f 2 N z K b x d 2 d v f 2 i + b B Y U e G s c C k j U M W i p 6 P J G E 0 I G 1 F F S O 9 S B D E f U a 6 / v Q q 9 b t 3 R E g a B r d q F h G X o 3 F A R x Q j p S X P L A o v G Y w R 5 2 h R V v f z M 8 8 s 2 V a j Z j d q N W h b 9 h I p q d r 1 a g U 6 m V I C G V q e + T 4 Y h j j m J F C Y I S n 7 j h 0 p N 0 F C U c z I o j C I J Y k Q n q I x 6 W s a I E 6 k m y w P X 8 B T r Q z h K B S 6 A g W X 6 v e J B H E p Z 9 z X n R y p i f z t p e J f X j 9 W o w s Figure 1: Revise and Update steps for an observed node.We update the proposed distribution r γ (t|z) with the empirical distribution p D (t|x) to produce the revised distribution p γ (t|x, z) (dashed).We then minimize the KL divergence between the proposed and revised distributions to update the proposed distribution.For unsupervised nodes, we rely on r γ (t|z) without updating.
However, there is not an obvious single way to learn to encode this richer information, as three different questions naturally come to mind: (1) If the model is representationally limited, can we address these representational limitations of the model itself, such as through developing richer latent representations z of the input x? (2) If there is available background or side information t that may be especially relevant for the modeling task at hand, can we develop systems that make use of it?(3) Even when side information is available, it may be noisy, e.g., it may not always be available (missing data) or it may contain errors: how can we make our models robust to this noisy side information?In the context of a text-based sequence modeling problem, we propose an approach that addresses all three of these questions.
We provide a conceptual overview of our method-RevUp-in Fig. 1 9 1 a U j o g l F m 1 A e g r / 8 8 i p p X 9 T 9 q 7 r / c F l r 3 B Z x l O E E T u E c f L i G B t x D E 1 p A Y Q z P 8 A p v T u q 8 O O / O x 6 K 1 5 B Q z x / A H z u c P 9 K + P U g = = < / l a t e x i t > . . .< l a t e x i t s h a 1 _ b a s e 6 4 = " p f Attention < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 o t q r F l X d m 5 I 0 i 2 A A a C 4 C K t s A n 4 = " > A A A B 6 n i c d V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 C k l p q 7 0 V v X i s a D + g D W W z 3 b R L N 5 u w u x F K 6 E / w 4 k E R r / 4 i b / 4 b N 2 0 E F X 0 w 8 H h v h p l 5 f s y Z 0 o 7 z Y a 2 s r q 1 v b B a 2 i t s q z X Z e u K l c 8 c w Q 9 Y b 5 9 P E Y 3 T < / l a t e x i t > t 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " t w j R 3 S E 1 8 g R L 5 E E I z L 5 m 3 S F 5 g z l y B L K t L C 3 E j a g m j K 0 6 R R s C N + f k v 9 J 4 6 j s n Z a 9 m + N S 9 S K L I w 9 7 s A + H 4 M E Z V O E K a l A H B n 1 4 h G d 4 c a T z 5 L w 6 b 7 P W n J P N 7 M I P O O 9 f X 0 m N 3 g = = < / l a t e x i t > t M < l a t e x i t s h a 1 _ b a s e 6 4 = " I 6 + o K P m 4 u j x w B c Z z j Z l n N A Y U 8 V Y = " > A A A B 7 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 B I v g q S Q i 6 r H o x W M F + w F t K J v N p l 2 6 2 Q 2 7 E 6 G U / g g v H h T x 6 u / x 5 r 9 x 0 + a g r Q 8 9 1 a U j o g l F m 1 A e g r / 8 8 i p p X 9 T 9 q 7 r / c F l r 3 B Z x l O E E T u E c f L i G B t x D E 1 p A Y Q z P 8 A p v T u q 8 O O / O x 6 K 1 5 B Q z x / A H z u c P 9 K + P U g = = < / l a t e x i t > . . .< l a t e x i t s h a 1 _ b a s e 6 4 = " p f y N 4 y w I a 9 v c v u n p F c + A 0 2 F h p j 6 w + y 8 9 + 4 w o 1 7 6 L m 3 d U r j e s 8 j x M < l a t e x i t s h a 1 _ b a s e 6 4 = " h 5 7 y X 2 t r u 0 R 9 I i e 0 S t 6 s 5 6 s F + v d + p i 1 l q x i Z h / 9 g f X 5 A 4 w G m S o = < / l a t e x i t > sentenced man to death < l a t e x i t s h a 1 _ b a s e 6 4 = " I 6 H e r Y 9 J 6 4 I 1 n d k j f 2 B 9 / g C a M J k n < / l a t e x i t > t 1 ⇠ GumbelSoftmax < l a t e x i t s h a 1 _ b a s e 6 4 = " m t u a z y N e 2 h G R g g x s L M F q E u 5 C 4 I M = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h p 5 K I q M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 m c 7 l d r h M I 4 S W S X r Z J P 4 Z I / U y D E 5 J X X C y D 1 5 J M / k x X l w n p x X 5 + 2 7 d c w Z z q y Q X 3 D e v w D K q Z g O < / l a t e x i t > 1 = Linear(z 1 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 x m t c 9 8 B A / 1 a 4 2 t 9 C x z L j W 5 4 a f g = " w Y b g / T 1 5 k l z v l r 3 9 s n e 5 V z o 6 H s e R J x t k k 2 w T j x y Q I 3 J K K q R K O H k g T + S F v D q P z r P z 5 r y P R n P O e G e d / I L z 8 Q 3 D q J 9 X < / l a t e x i t > Observed Knowledge Mixture < l a t e x i t s h a 1 _ b a s e 6 4 = " l u R O T s 0 A u Y 7 + A i l b h 5 x 2 L D M 8 w 0 A = " > A A A B / n i c d V D L S g M x F M 3 4 r P U 1 K q 7 c B I v g q i R F + t g V 3 b i s a B / Q D i W T Z t r Q T G Z I M m I Z C v 6 K G x e K u P U 7 3 P k 3 Z t o K K n o g c D j n 3 p t 7 j x 8 L r g 1 C H 8 7 S 8 s r q 2 n p u I 7 + 5 t b 2 z 6 + 7 t t 3 S U K M q a Figure 2: Components of RevUp for event modeling.We encode the sequence of events x into Gaussian latent z and discrete knowledge t.Red nodes such as t 1 are latent: [VERDICT] for the convicted man of murdering event.RevUp predicts these latent nodes by sampling from the GumbelSoftmax distribution.Blue nodes such as t M are observed: [SENTENCING] for the sentenced man to death event.We use this observed node to modify the proposed distribution and then we draw a sample from the revised distribution.
(red) learned proposal distribution with (blue) empirical information about when particular aspects of side knowledge appeared in training, and updating by minimizing the KL-divergence between the new distribution with the previous proposed one.
Our approach is inspired by recent work (Kong et al., 2019), which argued that a key to the success of neural advances in NLP (Mikolov et al., 2013;Devlin et al., 2018;Yang et al., 2019, i.a.,) is that they fall within the information theoretic-based InfoMax framework (Linsker, 1988).In this viewpoint, neural advances can be attributed to implicitly maximizing the mutual information between different representations of the same document.
However, we additionally want to use "side" or "external" knowledge to aid our modeling.The use of side information has long been desired for effective representation (Wyner, 1975;Wyner and Ziv, 1976).While this side information can occur in a variety of forms (Chen et al., 2020;Padia et al., 2018), we broadly view it as a concise, abstract view of information conveyed in the main input.
Our work centers on the idea that the latent neural representations z and side knowledge t act as complementary representations of the same input x: z provides a compact representation of the input itself, while t provides more generic information about the data or task.As such, we use an InfoMaxinspired formulation to incorporate external knowledge into the modeling and latent representation aspects.Since recent work (Joy et al., 2022;Chen et al., 2019) has suggested that providing some sort of "guidance" to these latent variables is beneficial and can lead to these guided models outperforming both fully supervised and fully unsupervised methods, learning with less-than-full observation of structured side knowledge is a requirement.We illustrate this in Fig. 2, where the correct value for the first piece of side information (t 1 = Verdict) is missing but the last piece of side information is known (t M = Sentencing).Our model must be able to operate over this partially observed sequence of side knowledge as a way of modeling the original event description.
We propose a unified approach, RevUp (Revise and Update Information Bottleneck), that maximizes the mutual information between the external knowledge and latent representation I(z; t).We consider the external knowledge t to be a discrete random variable in a lightly structured, semisupervised setting.We demonstrate the effectiveness of this approach on multiple modeling tasks.Our contributions are: (1) We provide a principled, information theoretic approach for injecting side information into neural discrete latent variable models.We provide theoretical backing to show that our methodology captures available information from external knowledge.
(2) We define a new model that leads to state-ofthe-art results in the semi-supervised setting on two standard event modeling datasets.
(3) We show that our proposed model generalizes the existing state-of-the-art model studied in Rezaee and Ferraro (2021), where the "parameter injection" method developed there can be understood as a special case of our framework.
(4) We experimentally show that our model is more robust when the external knowledge is noisy and it outperforms other baselines when the external knowledge is partially observed.

Background
In this section we present the related work and discuss the connections to our approach.
Mutual Information (MI) The mutual information I(x; y) measures how much two random variables x and y depend on each other.It is defined as I(x; y) = E p(x,y) log p(x,y) p(x)p(y) , where I(x; y) = 0, when x and y are independent.Conditional MI I(x; y|z) extends MI to measure the conditional dependence between x and y given z, as I(x; y|z) = E p(x,y,z) log p(x,y|z) p(x|z)p(y|z) .When I(x; y|z) = 0, there isn't any information between x and y that is not present in z.

Information Bottleneck Principle (IB)
The Information Bottleneck principle is a method for finding the most informative encoding of an input x with respect to the target output y.This is accomplished by finding the maximally compressed encoding z of x that is most informative about y (Tishby et al., 1999).The objective is to maximize where I(z; y) denotes z and y mutual information, and I(x; z) denotes x and z mutual information.β balances compression and informativeness.
Variational Autoencoder (VAE) The Variational Autoencoder is similar to the Information Bottleneck Principle in that it finds an encoding z of x to reconstruct x.However, the approach is different as the aim is to approximate the intractable posterior distribution p θ (z|x) with a variational distribution q φ (z|x) in order to optimize the Evidence Lower Bound (ELBO) (Kingma and Welling, 2013).
Incorporation of Side Knowledge Conceptually, we consider an "event" to be a condensed form of knowledge that outlines part of a particular situation.Semantic frames are a prime example of relevant side knowledge: a semantic frame can be thought of as an abstraction over highly related events.Though they also provide abstractions over the whos, whats, wheres and hows of events, in this setting, it is sufficient to consider an event's semantic frame as a type of label or cluster id.Multiple potential sources of semantic frames exists, e.g., FrameNet, PropBank, or VerbNet.
Recently, incorporating external knowledge has been investigated by numerous studies in a wide range of tasks beyond NLP (Kang et al., 2017;Flajolet and Jaillet, 2017;Zhang et al., 2021), and zeroshot classification (Badirli et al., 2021).Common across these efforts is treating the side information as part of the input to be encoded.This makes the side information prerequisite knowledge for the model to be learned, rather than supplementary.
A recent approach, called Sequential, Semi-supervised Discrete Variational Autoencoder (Rezaee and Ferraro, 2021, SSDVAE), is a new method for structured semi-supervised modeling that allows, but does not require, side information to guide the learning in an approach called "parameter injection."Because the SSDVAE framework is a deep latent variable model that is specifically designed to treat external knowledge as supplementary to the main task, we focus our study within it and the associated NLP-based computational event modeling tasks examined.They define parameter injection as Definition.(Parameter injection) Let t ∼ GumbelSoftmax(t; γ), where γ are the logits.If t is observed as external knowledge and represented with a one-hot vector 1(t) with t k * = 1, the operation γ = γ + ( γ * 1(t)) guides the latent variable t during the training because on average increases the t k * value (Maddison et al., 2016).
In RevUp, we build on and generalize this notion of parameter injection, in part by introducing the empirical distribution p D (t|x) to accommodate the dependence between data x and knowledge t.

RevUp: Revise and Update
Previous work (Rezaee and Ferraro, 2021, SSD-VAE) has empirically shown that incorporating external knowledge in discrete GumbelSoftmax parameters improves model performance with differ-ent metrics such as classification, event modeling and training speed.We seek to re-frame this view with information theory terms and generalize it.For fairness and consistency we stick with the same types of computational event modeling tasks that SSDVAE was developed for.We first describe the probabilistic model we develop (section 3.2) and the loss/training methodology (section 3.3).These enable smoothly incorporating side knowledge into a probabilistic neural model.Recall we provide a conceptual overview of RevUp in Fig. 1.

Setup
We assume that within a document, we have a sequence of M different predicate/argument-style events, x = x 1 ...x M , with each x m describing some action (the predicate) occurring among participants of that action (the arguments).If each event x m could also be paired with a semantic frame t m , so we have a corresponding sequence t, then considered across multiple events, t could provide a sequential generalization.E.g., in the example event sequence from Fig. 2 that includes "convicted man of murdering," and "sentenced man to death," if each event can be associated with a semantic frame (such as VERDICT and SENTENCING for these two events), then the corresponding sequence of frames provides both an abstraction over the entire event sequence, and an incredibly rich, yet low-dimensional, collection of side knowledge.
Having paired sequences x and t is not restrictive.Whether t is partially observed (i.e., not all x m have observed frames t m ) or noisy (i.e., the observed t m for x m is incorrect) during training, our approach can still extract useful signal.

Probabilistic Encoding
Our problem setup is that we have input text x describing some complex situation, paired with a partially observed sequence of frames t.To account for when side knowledge is/isn't observed, we define a knowledge indicator set l = l m | M m=1 with l m ∈ {0, 1}, where l m = 1 denotes the external knowledge t m is present and l m = 0 means the external knowledge is not available (latent).To enable successful modeling, we introduce z = z m | M m=1 , a set of M latent variables to first compress the information of the given inputs x and then be informative regarding t.We define p θ (z|x), a probabilistic encoder from data points x to the latent variables z, parameterized by a neural network θ.
We define a joint model over t, x and z as p(t, x, z; l) = p D (x) m p θ (z m |t m−1 , x)p γ t m |x m , z m ; l m , where p D (x) is the empirical distribution over the input variables x.Consistent with previous approaches, p θ (z m |t (s) , x; l m ) is a Gaussian.Similarly p γ t m |x m , z m ; l m posits a distribution over semi-supervised latent knowledge t given x and latent variables z.To learn richer representations, we force z and t to depend on one other in a sawtooth fashion: t i depends on z i , but z i depends on t i−1 .Fig. 2 shows an illustration.This is a novel segmented, autoregressive sequencing for event modeling.
We define r γ (t|z) as the proposed distribution that relates z to t, where γ is computed as the outcome of a neural encoding of z, NN(z).This distribution must be learned.For simplicity we omit the node index m when possible.
Revision Phase Incorporating the external knowledge in the training phase must satisfy two criteria: First, without observation (l m = 0), we just rely on r γ (t m |z m ).Second, when we have access to external knowledge, p D (t m |x m ) should be used for guidance and not discarding the proposed distribution.We define the revised distribution p γ t m |x m , z m ; l m as a smoothed weighted average of the proposed distribution and the empirical distribution as , where λ ∈ R + is a weighting parameter for balancing the proposed and empirical distributions.In this setting, λ depends on the level of confidence in our external knowledge, where for less noisy knowledge, we can choose higher values for λ.In practice, we found that λ = 1.0 works reasonably.1 If side knowledge is absent (l m = 0), we have λm = 1.0 so p γ (t m |x m , z m ) = r γ (t m |z m ): the model just uses the proposed distribution.This allows gradients to propagate through the network.
During the test phase we do not use any external knowledge, so the revised distribution p γ t|x, z; l reduces to the proposed distribution r γ (t|z).
Analysis and Insights If we define λm = 1/(1 + λl m ), we rewrite this as This slight notation change is beneficial, as it lets us characterize the behavior of our revision step in terms of expected observability of side knowledge.Specifically, in a semi-supervised setting, if the probability of observing any particular piece of side information can be modeled as l ∼ Bern( ), where is the observation probability, by marginalizing out l we have This indicates that with more observation, we rely more heavily on the empirical distribution and less on the proposed distribution.For space, see Appendix E for more details/the derivation.
Finally, our framework generalizes SSDVAE's parameter injection (see Appendix F): Theorem 1.If r γ (t|z) = Cat(γ) , a categorical distribution with parameters γ, and the empirical distribution p D (t|x) is a one-hot representation with t k * = 1, the revision step reduces to SSDVAE parameter injection.
We have described how to guide the latent variable t in the encoding phase.Next we define our objective function to capture the background information and decoding t to effectively model event descriptions.

Training Objective and Decoding
After training, the model relies solely on the proposed distribution r γ (t m |z m ) to predict t m , implying the statistics of t m will only depend on z m .To capture the background information in t, minimize I(x; t|z) during training.In information theory terms, in the ideal situation where I(x; t|z) = 0, there is not any residual information between x and t that is not captured by the latent representation z (Kirsch et al., 2020).Therefore without the need of x and p D (t|x), the latent z is enough to predict t.
To be meaningfully learned, t must be informative enough to make some prediction or reconstruction.For clarity, we refer to the targets as y, though for a task like language modeling, x = y.To achieve this learning, we maximize I(y, t).
Together, our ideal objective is L = −I(y; t) + αI(x; t|z), where α is a tunable hyperparameter.This is difficult to optimize because within each I term there are intractable marginalizations (such as over x).To understand why this is our ideal objective, we show that maximizing the intractable mutual information I(t; z) is inherently included in this unified objective for the reconstruction tasks: Theorem 2. For tasks where we maximize I(x; t): minimizing I(x; t|z) leads to maximizing I(t; z).This theorem shows that reconstruction tasks like language modeling explicitly maximize the mutual information between different data representations, which is consistent with the InfoMax principle.See Appendix G for the proof.As a consequence of Theorem 2, we are maximizing the mutual information between two views of x: compressed representation z and side information t.
With that understanding, we proceed to a tractable approximation for our objective.Following Alemi et al. ( 2016), we have −I(y; t) = −E p(y,z,t) log p(y|t) p(y) .We approximate this as = −E p(y,z,t) log q φ (y|t) Since p(y|t) is intractable, we approximate it via a decoder q φ (y|t) that can be computed by a neural network, denoted as φ.As the task entropy H(y) is constant, we just minimize L y .While not explicitly reflected in Eq. 2, note that irrespective of whether side information is present or not, I(y; t) depends on r γ (t|z).See Appendix B.1 for additional analysis of Eq. 3.
Updating Phase The term I(x; t|z) = makes z informative about t.We approximate with a surrogate objective L I : This surrogate objective encourages the proposed distribution r γ (t|z) to be updated to be close to the revised distribution p γ (t|x, z; l).After training, the proposed distribution r γ (t|z) plays the role of p γ (t|z, x) and does not need to explicitly use p D (t|x).Throughout, we refer to L I as updating.
The updating objective for our model is given by where by expanding further we obtain This sum is computed over observed nodes.For unsupervised nodes, the revised distribution and proposed distribution are equal and their KL terms are zero.The last term is a classification term.
Regularization To improve the model generalization ability, we introduce regularization terms for z and t.We constrain the mutual information between data x and z latent representations, I(x; z) = E p(z,x) log p θ (z|x) p(z) as where we introduce a variational distribution q(z) due to intractability of p(z).For simplicity we assume that q(z) factorizes over independent Gaussian random variables as q(z) = M m=1 q(z m ) , where the variational distribution over z m is given by the unit Gaussian z m ∼ N (0, I).Here L z is estimated with the standard Monte-Carlo sampling: We reduce the distance between the proposed distribution r γ (t|z) and a uniform distribution U(t): The Kullback-Leibler (KL) divergence terms in Eq. 7 and Eq. 9 help avoid overfitting.An alternative interpretation for these regularization terms is discarding the task-irrelevant information.We combine Eqs. 3, 4, 7 and 9 to at our final objective where β and ζ denote the trade-off parameters, and can be set empirically, as described in Appendix I.

Architecture For Event Modeling
To ensure fair comparisons, we focus on the reconstruction task similar to a β-VAE framework (Higgins et al., 2016).The overall structure is depicted in Fig. 2. Following previous work on event modeling (Pichotta and Mooney, 2016;Rezaee and Ferraro, 2021;Weber et al., 2018b;Gao et al., 2022), we represent each document x as a sequence of M events, where each event is a 4 tuple of predicate (verb), two main arguments (subject and object), and a modifier (if applicable).Each event is associated with a discrete semantic frame.E.g., convicted man murdering of is an event and the semantic frame is [VERDICT].All the possible frames are collected in a vocabulary of size T .In this setting, we obtain a point estimate for p D (t m |x m ) as δ(x m , t m ).Sampling from this empirical distribution outputs a one-hot vector of dimension T .The proposed distribution r γ (t|z) is a Gumbel-Softmax distribution.We found that the Gumbel-Softmax algorithm is particularly suitable for our task because it can effectively approximate discrete distributions and backpropagate gradients.While experiments with the Straight-Through Gumbel-Softmax (STGT) yielded near identical performance to the Gumbel-Softmax method, we opted for the latter.STGT generates one-hot vectors during the forward pass, but requires an approximation of the gradient using Gumbel-Softmax samples.By setting a low temperature of 0.5, the generated Gumbel-Softmax samples become almost identical to one-hot vectors, eliminating the need for gradient approximation.The effects on the gradient were carefully analyzed and the results are presented in Damadi and Shen (2022), where a comprehensive study of gradient properties was conducted.To learn richer representations, we define an embedding matrix E ∈ R T ×dt , to convert a simplex frame into a vector representation as e m = t T m E.
Encoding and Decoding With data point x ∼ p D (x), we encode the whole sequence into recurrent hidden representations H = {h} T t=1 .For each event m, we draw Gaussian random variable z m ∼ N (µ m , σ m ) where µ m and σ m are the outputs of attention layers over frame embedding e m−1 and hidden states H.We use a linear mapping over z m to compute Gumbel softmax parameters for the proposed distribution r γ (t m |z m ).Given the proposed distribution r γ (t m |z m ) and the empirical distribution p D (t m |x m ), we first draw a Bernoulli sample λm , then we draw knowledge sample from the mixture of probabilities by t m ∼ p γ (t m |x m , z m ; l m ).From the encoding phase, all the embedding vectors are gathered into e = {e m } M m=1 .At generation time, analogously to the encoder, we use an atten-   3b, where in Fig. 3a we show that the distribution r γ (t|z) that RevUp learns very closely matches the ground truth.In Fig. 3c, we show how RevUp affects both the proposed and revised distributions.Initially, the r γ (t|z) prediction is random.After 50 iterations it is closer to the revised distribution p γ (t|z, x).After 100 iterations of training, both of them get close to the ground truth distribution.
tion layer over the decoder recurrent hidden state h and frame embeddings e, resulting in decoder logits g.The generative distribution over possible next tokens is given by x t ∼ p(x t |g).

Experimental Results
Following previous work (Rezaee and Ferraro, 2021), we experiment on event modeling tasks using Concretely Annotated versions of New York Times Gigaword (NYT) and Wikipedia datasets (Ferraro et al., 2014).Both are English and have FrameNet semantic frames provided via Se-maFor (Das et al., 2014).We perform a direct comparison with the previous state-of-theart work (Rezaee and Ferraro, 2021) by using the same frame types, which were derived from the FrameNet annotations on Wikipedia articles.
Training, validation and test splits for NYT and Wikipedia have 320k/17k/7k and 240k/8k/11k documents, respectively.For both the validation and test phases we set l m = 0 (unsupervised).We average results over three runs (standard deviations in the appendix).We tuned the hyperparameters on the validation dataset.See Appendix I for additional implementation and data details.

Small Dataset Example
We first examine RevUp by examining its behavior on a small, focused example dataset.We sampled 400 newswire documents from the NYT dataset.
We trained a RevUp model, with 10 semantic frame types (10 options for each t) and where each z was 100 dimensional.To obtain the ground-truth distributions of frames given events, we just focus on the predicates and collect all the semantic frames given that specific predicate.We carefully selected the frame types and data to reflect a diverse range of difficulties.For example, events including the did predicate are always associated with the IN-TENTIONALLY_ACT frame and the corresponding semantic frames for the gets predicate are ARRIV-ING, BECOMING and GETTING.RevUp predictions are acquired from the revised distribution r γ (t|z): given an event x we first draw a Gaussian sample z and then use argmax(t) to find the proposed index.Finally, we normalize across all the proposed frames.We visualize the revision and update steps for the predicate "made" in Fig. 3c.
As evidenced by Eq. 4, the minimization of the KL-divergence between the proposed distribution r γ (t|z) and the revised distribution p γ (t|x, z) results in the elimination of non-essential information.Fig. 3c gives a visual representation of this phenomenon.At iteration 0, the proposed distribution p γ (t|x, z) is almost uniform, while the revised distribution r γ (t|z) is more aligned with the ground distribution.With additional iterations, reaching 100, the gap between these distributions reduces, ultimately leading to convergence with the ground truth for both distributions.
Additionally, our framework is able to effectively capture the conditional distribution of frames given predicates in diverse scenarios, as demonstrated in the heatmap figures in Figs.3a and 3b.For instance, certain predicates such as "did," "says," and "knew" have a single associated frame of "in-tentionally_act," "awareness," and "statement," respectively.Meanwhile, the predicate "made" has two possible semantic frames of "arriving" and "causation."In some cases, such as the predicate "get," there are even three possible frames of "arriving," "becoming," and "getting."The comparison between the ground truth distribution of frames given predicates and the normalized samples highlights the accuracy of our model in capturing these conditional distributions, with minimal error.

Baselines
We compare the proposed RevUp method with the following event modeling approaches.(a) RNNLM (Pichotta and Mooney, 2016): A Bidirectional GRU cell with two layers, hidden dimension of 512, gradient clipping at 5 and Glove 300 embeddings to represent words.We used the implementation provided in (Weber et al., 2018b

Effect of Noisy Knowledge
We empirically compare the predictive performance of RevUp and SSDVAE with noisy knowledge.To do so, in a fully supervised setting, for each event x m , instead of using the associated semantic frame t m , with probability η we replace t m with a random semantic frame tm .We train both models with this new training dataset.During the testing phase, we compare the predicted knowledge with ground-truth one.results validate the effectiveness of our information capturing strategy when the external knowledge is noisy.As we increase η, SSDVAE performance degrades much more than RevUp.As an instance when η = 0.9, SSDVAE's accuracy is almost zero but our proposed model can achieve 0.41.

Effect of Incomplete Knowledge
We consider the case of semi-supervised learning, where with probability a node t m is observed.We report the classifications on the latent nodes as KL p D (t|x) r γ (t|x) ≈ −E p D (x)p θ (z|x)p D (t|x) log r γ (t|z).The results are shown in Fig. 4. We report two widely-used classification metrics including accuracy and F1 to evaluate the performance of all methods.SSDVAE just relies on the guidance and classification but RevUp also uses the updating phase to shift the available knowledge from side information into latent variables.Thus the results demonstrate the superiorities of RevUp to predict knowledge when they are partially observed, attributable to its novel information injection and learning.
To investigate whether the proposed approach works for task representation, we compare the perplexity of our approach to prior work in Table 2. Perplexity has been commonly used in the literature, which allows us to provide a fair comparison with previous efforts.We investigate the effect of supervision with various observation probabilities = 0.0 (unsupervised), 0.1, 0.7 and 1.0 (fully supervised) on the generated samples from the model.We see that our model is able to obtain lower perplexity scores than the previous event modeling methods.We observe as increases, the performance of our proposed model improves.For each observation probability, our method outperforms SSDVAE.The results demonstrate that our method achieves state-of-the-art performance with large margins from the baselines.

Related Work
Information Bottleneck The concept of using side information for discrete source coding was explored by Wyner (1975); Wyner and Ziv (1976).The Information Bottleneck (IB) principle was then introduced by (Tishby et al., 2000) to compress input variables while predicting a target.Chechik and Tishby (2002) proposed incorporating negative side information through an auxiliary loss in a supervised manner.Our method stands apart by handling both supervised and semi-supervised settings with ease.The Variational Information Bottleneck (VIB) was introduced by Alemi et al. (2016), which improved the IB estimation through amortized variational methods.Voloshynovskiy et al. (2020) extended the VIB method for semi-supervised classification.In relation to our work, there have been numerous studies focused on maximizing the mutual information between different views and discarding the non-shared information (Federici et al., 2019;Wan et al., 2021;Wang et al., 2019;Mao et al., 2021;Yan et al., 2019).Event Modeling In recent years, much research has been dedicated to the challenge of modeling the sequence of events (Weber et al., 2018b;Rezaee and Ferraro, 2021;Gao et al., 2022).One such contribution was made by Weber et al. (2018a), who introduced a tensor-based composition to effectively capture semantic event relations.Gao et al. (2022) proposed a self-supervised contrastive learning approach based on co-occurring events.Another approach suggested by Weber et al. (2018b) involves the use of Recurrent Neural Networks (RNNs) to model the hierarchical latent structures that exist in sequences of events.This methodology was further developed in a subsequent study by Rezaee and Ferraro (2021), where external knowledge was incorporated into the latent layer for enhanced modeling, which is highly relevant to our proposed approach.

Conclusion
We show how to incorporate noisy partially observed side knowledge source along with latent variables.To do so, we generalized the main idea of parameter injection and maximizing the mutual information between external knowledge and latent variables.Our experiments show that our approach is more robust to noisy knowledge and outperform other baselines for the event modeling task that is in part supported by the Army Research Laboratory, Grant No. W911NF2120076, and by the Air Force Research Laboratory (AFRL), DARPA, for the KAIROS program under agreement number FA8750-19-2-1003.The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either express or implied, of the Air Force Research Laboratory (AFRL), DARPA, or the U.S. Government.

Limitations
Some limitations of our work are that multiple hyper parameters should be tuned for each task.While we provide guidance and insights into effective settings (and some intuition as to why), we acknowledge that the settings may be domain dependent.
Because we use semantic frames as our side knowledge, our focus is on improving the representation and use of discrete latent variables.While current NLP approaches have often focused on textto-text methods for input and output, and individual words in text can be considered a form of discrete latent variables, we note that these methods are driven by large continuous embedding methods.While we believe this work can be extended to continuous cases, the approximations did make use of aspects of discrete variables, and they would need to be re-derived.
RevUp depends on mixing in statistics and learned representations from external side knowledge.While we envision this side knowledge as containing useful generalizations and semantic information, such resources could encode overly broad generalizations or other biases.While the degree of this mixture can be adjusted, imperfections or biases in the external knowledge could be captured and propagated through RevUp.
We focus on the task of event modeling, but we believe RevUp represents a step towards improving settings where noisy side information is available.
where for simplicity we assume that q(z) factorizes over independent random variables as q(z) = q(z 1 , z 2 , . . ., z M ) = M m=1 q(z m ).We parameterize each z m via a multivariate Normal, q(z m ) = N (0, I), and p θ (z m |z where both f µ and f Σ are neural networks.With this formulation, we can calculate the KLdivergence terms of Eq. 15 in closed forms. B.1 Approximating I(y; t) (Eq. 3) In the same way for the task representation, we have −I(y; t) = −E p(y,z,t) log p(y|t) p(y) ≤ −E p(y,z,t) log q φ (y|t) p(y) ≤ −E p(y,z,t) log q φ (y|t) + H(y), where H(y) is constant and we ignore it.We have: E p(y,z,t) log q φ (y|t) = E p D (x,y)p θ (z|x)pγ (t|z) log q φ (y|t), where we first sample x, y ∼ p D (x, y), then z (sz) ∼ p θ (z|x) and finally t (st) ∼ r γ (t|z (sz) ).In this approximation S z and S t are the total number of samples for z and t respectively.We wish to clarify why this term is included in the objective and how optimizing it affects the model parameters.The answer is that in computing I(y; t), the distribution used to compute the mutual information's expectation depends on the proposed distribution r γ .This is by construction and irrespective of whether side information is present or not.This is where the model parameters come in.
Specifically, while by definition I(y; t) = This red term, p γ (t m |x m , z m ; l m ), is defined as the interpolation of a proposal distribution r γ (t m |z m ) and an empirical distribution t m |x m .Defined as a GumbelSoftmax, this proposal distribution explicitly allows uncertainty.Therefore, even when side information is present, we sample the value of t m to approximate that final expectation.Finally, as mentioned in the paper, this last line is intractable, so we learn an approximation q φ (y|t), which introduces additional model parameters to learn.We compute q φ using the sampled values t 1 ...t M .

C Update Step Upperbound
In this section, we show how we approximate the update term I(x; z|z) in Eq. 4. We have the following We show how we approximate the upperbound presented in Eq. 24.
E pγ (t,z,x) log p γ (t|z, x) r γ (t|z) = E p D (x)p θ (z|x)pγ (t|z,x) log p γ (t|z, x) A 1 − E p D (x)p θ (z|x)pγ (t|z,x) log r γ (t|z) 5 6 S C c t R 7 7 l e 3 H 9 F E g D S U E 6 0 7 v h e b I C X K M M p h V u o m G m J C x 2 Q I H U s l E a C D d B 5 8 h k + t 0 s e D S N k n D Z 6 r v z d S I r S e i t B O Z j H 1 s p e J / 3 m d x A y u g 5 T J O D E g 6 e LQ I O H Y R D h r A f e Z A m r 4 1 B J C F b N Z M R 0 R R a i x X Z V s C f 7 y l 1 d J 8 7 z q X 1 b 9 + 4 t K 7 S a v o 4 i O 0 Q k 6 Q z 6 6 Q j V 0 h + q o g S h K 0 D N 6 R W / O k / P i v D s f i 9 G C k + 8 c o T 9 w P n 8 A Z M e T k g = = < / l a t e x i t > Revise < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 j E o b 8 t J e J M l a C K k M F c d u 5 g c j x M = " > A A A B + H i c b V B N S 8 N A F N z U r 1 o / G v X o J V g E T y U R U Y 9 F L x 4 r m L b Q h r L Z v L R L d z d h d y P U 0 F / i x Y M i X v 0 p 3 v w 3 b t o c t H V g Y Z h 5 j z c 7 Y c q o 0 q 7 7 b V X W 1 j c 2 t 6 r b t Z 3 d v f 2 6 f X D Y U U k m C f g k Y Y n s h V g B o w J 8 T T W D X i o B 8 5 B B N 5 z c F n 7 3 E a S i i X j Q 0 x Q C j k e C x pR g b a S h X R 9 w r M e S 5 3 4 a Y Q 2 z o d 1 w m + 4 c z i r x S t J A J d p D + 2 s Q J S T j I D R h W K m + 5 6 Y 6 y L H U l D C Y 1 Q a Z g h S T C R 5 B 3 1 C B O a g g n w e f O a d G i Z w 4 k t h b K e u C B o Y 2 + I I N w f v 9 5 b / g b L P q b V W 9 k 8 3 S 3 v 4 g j g my R J Z J m X h k m + y R I 3 J M a o S R O / J E X s i r c + 8 8 O 2 / O + 5 d 1 y B n 0 L J I f 5 X x 8 A v Z x s V w = < / l a t e x i t > p (t|x, z) = ˆ r (t|z) + (1 ˆ )p D (t|x) < l a t e x i t s h a 1 _ b a s e 6 4 = " U 6 i 0 + K Z T d / x R A s e C c 0 5 T U k + k e z s = " > A A A C J H i c b V D L S g M x F M 3 4 t r 6 q L t 0 E W 6 E F K T N d q O B G d C P o o o K t h U 4 p m T S t o U l m S O 5 I 6 7 Q f 4 8 Z f c e P C B y 7 c + C 2 m t Q t t P R A 4 n H M u N / c E k e A G X P f T m Z m d m 1 9 Y X F p O r a y u r W + k N 7 c q J o w 1 Z W U a i l B X A 2 K Y 4 I q V g Y Ng 1 U g z I g P B b o L O 2 d C / u W P a 8 F B d Q y 9 i d U n a i r c 4 J W C l R v o 4 6 w P r Q i K 5 w h e X A z / g 7 V z U S P w 2 k Z I M c t D v 7 t / n + 3 3 9 W 7 r P D 2 P 5 b C O d c Q v u C H i a e G O S Q W O U G u k 3 v x n S W D I F V B B j a p 4 b Q T 0 h G j g V b J D y Y 8 . The building blocks of RevUp are revision of modeling side information t: forming a new distribution by combining the arXiv:2205.12248v2[cs.LG] 14 Feb 2023 <l a t e x i t s h a 1 _ b a s e 6 4 = " + 1 o 1 7 6 L m 3 d U r j e s 8 j i K c w C m c g w e X 0 I B b a I I P D D g 8 w y u 8 O d J 5 c d 6 d j 0 V r w c l n j u E P n M 8 f x x e O A Q = = < / l a t e x i t > x 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " J c 9 M o l 8 A k 8 m P 7 B R J m c 7 i s T r 4 + q 8 e 7 O y 7 X r P I 4 C H M M J n I E H l 1 C D W 6 h D A x g M 4 B l e 4 c 0 R z o v z 7 n z M W 1 e c f O Y I / s D 5 / A E S O I 2 p < / l a t e x i t > z 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " m b T s m G 4 C 2 + / J c 0 T 6 r e W d W 7 P a 3 U L v M 4 i n A A h 3 A M H p x D D a 6 h D g 1 g M I A n e I F X R z j P z p v z P m 8 t O P n M P v y C 8 / E N O y S N x A = = < / l a t e x i t > z M < l a t e x i t s h a 1 _ b a s e 6 4 = " I 6 + o K P m 4 u j x w B c Z z j Z l n N 7 u 3 v 7 p Y P D j o o S S W i b R D y S P R 8 r y p m g b c 0 0 p 7 1 Y U h z 6 n H b 9 6 V X m d + + p V C w S d 3 o W U y / E Y 8 E C R r A 2 0 q 0 e u s N S 2 b E b N a d R q y H H d h b I S N W p V y v I z Z U y 5 G g N S + + D U U S S k A p N O F a q 7 z q x 9 l I s N S O c z o u D R N E Y k y k e 0 7 6 h A o d U e e n i 1 D k 6 N c o I i K c w C m c g w e X 0 I B b a I I P D D g 8 w y u 8 O d J 5 c d 6 d j 0 V r w c l n j u E P n M 8 f x x e O A Q = = < / l a t e x i t > x 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " J c 9 M o l 8 A k 8 m P 7 B R J m c 7 i s T r 4 + q B O S a n 5 I I 0 S J N w c k 8 e y T N 5 8 R 6 8 J + / V e 5 u 2 z n j l z D b 5 A e / 9 C z h s m y U = < / l a t e x i t > convicted man of murdering < l a t e x i t s h a 1 _ b a s e 6 4 = " r 5 N T O M o 0 u y 6 E I I / p 4 8 T i 5 3 K s F e J T j b L V c P R 3 E U y D r Z I F s k I P u k S o 7 J K a k R T u 7 J I 3 k m L 9 6 D 9 + S 9 e m / f p R P e q G e N / I L 3 / g U 2 t 6 C l < / l a t e x i t > Unobserved Knowledge Sampling < l a t e x i t s h a 1 _ b a s e 6 4 = " a 8 O R T 0 i X j G 2 w 9 S l E t 5 H y + b 2 a h 7 e 3 N x y d 7 e s T u n C U f + g 4 1 / x c Z C E V s b O / + N m 4 9 C j Q + G e b w 3 w + 4 8 P 5 H C o O t + O b m p 6 Z n Z u f x 8 Y W F x a X m l u L p 2 b e J U c 6 j y W M b 6 x m c G p F B Q R Y E S b h I N L P I l 1 2 u k I 6 I I t T Y x P I 2 h K 9 L 4 f + k V S r i c h F f n R X q 5 4 s 4 c u A I H I N T g E E F 1 M E l a I A m o C A F D + A J P D v 3 z q P z 4 r z O S 5 e c R c 8 B + A H n 7 R N P c J Z g < / l a t e x i t > [Sentencing]< l a t e x i t s h a 1 _ b a s e 6 4 = " t e x i t s h a 1 _ b a s e 6 4 = " O 3 X 6 N 3 f H r z C X 4 g k K O / 8 5 T G / K X J o = " > A A A B 9 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K q M e g I B 4 j m A c k S 5 i d z C Z D Z h / O 9 A b D k u / w 4 kE R r 3 6 M N / / G 2 W Q P m l j Q U F R 1 0 9 3 l x V J o t O 1 v q 7 C y u r a + U d w s b W 3 v 7 O 6 V 9 w + a O k o U 4 w 0 W y U i 1 P a q 5 F C F v o E D J 2 7 H i N P A k b 3 m j m 8 x v j b n S I g o f c B J z N 6 C D U P i C U T S S 2 0 X + h O m t o g H X 0 1 6 5 Y l f t G c g y c X J S g R z 1 X v m r 2 4 9 Y E v A Q m a R a d x w 7 R j e l C g W T f F r q J p r H l I 3 o g H c M D b M t b j o 7 e k p O j N I n f q R M h U h m 6 u + J l A Z a T w L P d A Y U h 3 r R y 8 T / v E 6 C / p W b i j B O k I d s v s h P J M G I Z A m Q v l C c o Z w Y Q p k S 5 l b C h l R R h i a n k g n B W Xx 5 m T T P q s 5 F 1 b k / r 9 S u 8 z i K c A T H c A o O X E I N 7 q A O D W D w C M / w C m / W 2 H q x 3 q 2 P e W v B y m c O 4 Q + s z x 9 W 0 J J 5 < / l a t e x i t > Frames (c) The proposed and revised distributions converge to the ground truth distribution.

Figure 3 :
Figure 3: A demonstration of RevUp working on a small dataset with 10 frames.Ground truth frame distributions are shown in Fig.3b, where in Fig.3awe show that the distribution r γ (t|z) that RevUp learns very closely matches the ground truth.In Fig.3c, we show how RevUp affects both the proposed and revised distributions.Initially, the r γ (t|z) prediction is random.After 50 iterations it is closer to the revised distribution p γ (t|z, x).After 100 iterations of training, both of them get close to the ground truth distribution.

Figure 4 :
Figure 4: RevUp vs SSDVAE accuracy for sequential semantic frame classifications for NYT and Wikipedia dataset.See Appendix H for full results.For all degree of observation probabilities, RevUp outperforms the baseline.

Table 1 :
Results in Table1show that classification and parameter injection in SSDVAE are not enough for capturing knowledge.These Effect of Noise on robustness.We present standard deviations in Appendix H.

Table 2 :
Test perplexity results, varying the percent of side knowledge that was observed during training.We present standard deviations in Appendix H (table4).
(t; x|z) = E pγ (t,z,x) log p γ (t, x|z) p(t|z)p(x|z)First, we should note that in Eq. 23, the distribution p(t|z) is not equal to r γ (t|z), because for each node m, we havep(t m |z m ) = p(t m , x m |z m ) dx m = p(t m |x m , z m )p(x m |z m ) dx m = λm r γ (t m |z m )p(x m |z m ) dx m + (1 − λm )p D (t m |x m )p(x m |z m ) dx m = r γ (t m |z m ).