Neuro-Symbolic Reinforcement Learning with First-Order Logic

Deep reinforcement learning (RL) methods often require many trials before convergence, and no direct interpretability of trained policies is provided. In order to achieve fast convergence and interpretability for the policy in RL, we propose a novel RL method for text-based games with a recent neuro-symbolic framework called Logical Neural Network, which can learn symbolic and interpretable rules in their differentiable network. The method is first to extract first-order logical facts from text observation and external word meaning network (ConceptNet), then train a policy in the network with directly interpretable logical operators. Our experimental results show RL training with the proposed method converges significantly faster than other state-of-the-art neuro-symbolic methods in a TextWorld benchmark.


Introduction
Deep reinforcement learning (RL) has been successfully applied to many applications, such as computer games, text-based games, and robot control applications (Mnih et al., 2015;Narasimhan et al., 2015;Kimura, 2018;. However, these methods require many training trials for converging to the optimal action policy, and the trained action policy is not understandable for human operators. This is because, although the training results are sufficient, the policy is stored in a black-box deep neural network. These issues become critical problems when the human operator wants to solve a real-world problem and verify the trained rules. If the trained rules are understandable and modifiable, the human operator can control them and design an action restriction. While using a symbolic (logical) format as representation for stored rules is suitable for achieving interpretability and quick training, it is difficult to train the logical rules with a traditional training approach.
-= Studio =-I am required to announce that you are now in the studio. You don't like doors? Why not try going north, that entranceway is unblocked. You don't like doors? Why not try going south, that entranceway is unguarded.

FOL converter
Visited ¬ Visited Figure 1: Overview of the proposed method. The agent takes a text observation from the environment, and the first-order logical facts are extracted from an FOL converter that uses a semantic parser, ConceptNet, and history. The weights (shown by line thickness in this figure) of the network are updated by these extracted predicate logics. Solid lines show one trained rule; when the agent finds a direction x and the direction x has not been visited, the agent takes a "Go x" action. Dashed lines show the initial connections before training.
In order to train logical rules, a recent neurosymbolic framework called the Logical Neural Network (LNN) (Riegel et al., 2020) has been proposed to simultaneously provide key properties of both the neural network (learning) and the symbolic logic (reasoning). The LNN can train the symbolic rules with logical functions in the neural networks by having an end-to-end differentiable network minimizes a contradiction loss. Every neuron in the LNN has a component for a formula of weighted real-valued logics from a unique logical conjunction, disjunction, or negation nodes, and then it can calculate the probability and logical contradiction loss during the inference and training. At the same time, the trained LNN can extract obtained logical rules by selecting high weighted connections that represent the important rules for an action policy.
In this paper, we propose an action knowledge acquisition method featuring a neuro-symbolic LNN framework for the RL algorithm. Through experiments, we demonstrate the advantages of the proposed method for real-world problems which is not logically grounded games such as Blocks World. Since natural language observation is easier to convert into logical information than visual or audio, we tackle text-based interaction games for verifying the proposed method. Figure 1 shows an overview of our method. The observation text is input to a semantic parser to extract the logical values of each propositional logic. In this case, the semantic parser finds there are two exits (north and south). The method then converts first-order logical (predicates) facts from the propositional logics and categories of each word, such as ∃x ∈ {south, north}, ⟨find x⟩ = T rue and ∃x ∈ {east, west}, ⟨find x⟩ = F alse. These extracted predicated logics are fed into LNN which has some conjunction gates and one disjunction gate. The LNN trains the weights for these connections by the reward value to obtain the action policy. The contributions of this paper are as follows.
• The paper describes design and implementation of a novel neuro-symbolic RL for a textbased interaction games.
• The paper explains an algorithm to extract first-order logical facts from given textual observation by using the agent history and Con-ceptNet as an external knowledge.
• We observed our proposed method has advantages for faster convergence and interpretability than state-of-the-art methods and baselines by ablation study on the text-based games.

Related work
Various prior works have examined RL for textbased games. LSTM-DQN (Narasimhan et al., 2015) is an early study on an LSTM-based encoder for feature extraction from observation and Q-learning for action policy. LSTM-DQN++  extended the exploration and LSTM-DRQN  was proposed for adding memory units in the action scorer. KG-DQN (Ammanabrolu and Riedl, 2019) and GATA (Adhikari et al., 2020) extended the language understanding. LeDeepChef (Adolphs and Hofmann, 2020) used recurrent feature extraction along with the A2C (Mnih et al., 2016). CREST (Chaudhury et al., 2020) was proposed for pruning observation information. TWC (Murugesan et al., 2021) was proposed for utilizing common sense reasoning. However, none of these studies used the neuro-symbolic approach.
For recent neuro-symbolic RL work, the Neural Logic Machine (NLM) (Dong et al., 2018) was proposed as a method for combination of deep neural network and symbolic logic reasoning. It uses a sequence of multi-layer perceptron layers to deduct symbolic logics. Rules are combined or separated during forward propagation, and an output of the entire architecture represents complicated rules. In this paper, we compare our method with this NLM.

Proposed method 3.1 Problem formulation
As text-based games are sequential decisionmaking problems, they can naturally be applied to RL. These games are partially observable Markov decision processes (POMDP) (Kaelbling et al., 1998), where the observation text does not include the entire information of the environment. Formally, the game is a discrete-time POMDP defined by ⟨S, A, T, R, ω, O, γ⟩, where S is a set of states (s t ∈ S), A is a set of actions, T is a set of transition probabilities, R is a reward function, ω is a set of observations (o t ∈ ω), O is a set of conditional observation probabilities, and γ is a discount factor. Although the state s t contains the complete internal information, the observation o t does not. In this paper, we follow following two assumptions: one, the word in each command is taken from a fixed vocabulary V , and two, each action command consists of two words (verb and object). The objective for the agent is to maximize the expected discounted reward E[∑ t γ t r t ].

Method
The proposed method consists of two processes: converting text into first-order logic (FOL), and training the action policy in LNN.

FOL converter
The FOL converter converts a given natural observation text o t and observation history (o t−1 , o t−2 , ...) into first-order logic facts. The method first converts text into propositional logics l i,t by a semantic parser from o t , such as, the agent understands an opened direction from the current room. The agent then retrieves the class type c of the word meaning in propositional logic l i,t by using ConceptNet (Liu and Singh, 2004) or the network of another word's definition. For example, "east" and "west" are classified as a direction-type, and "coin" is as a money-type. The class is used for selecting the appropriate LNN for FOL training and inference.

LNN training
The LNN training component is for obtaining an action policy from the given FOL logics. LNN (Riegel et al., 2020) has logical conjunction (AND), logical disjunction (OR), and negation (NOT) nodes directly in its neural network. In our method, we prepare an AND-OR network for training arbitrary rules from given inputs. As shown in Fig. 1, we prepare all logical facts at the first layer, several AND gates (as many as the network is required) at the second layer, and one OR gate connected to all previous AND gates. During the training, the reward value is used for adding a new AND gate, and for updating the weight value for each connection. More specifically, the method is storing the replay buffer which has current observation o t , action a t , reward r t , and next observation o t+1 value. For each training step, the method selects some replies, and it extracts firstorder logical facts from current observation o t and action a t . The LNN trains by this fact inputs and reward; that means it forwards from input facts through LNN, calculates a loss values from the reward value, and optimizes weights in LNN. The whole training mechanism is similar to DQN (Mnih et al., 2013), the difference from these is the network. To aid the interpretability of node values, we define a threshold α ∈ [

Experiments
We evaluated the proposed method on a coincollector game in TextWorld  with three different difficulties (easy, medium, and hard). The objective of the game is to find and collect a coin which is placed in a room within connected rooms. Since we tackle a real-world game problem rather than a symbolic games, we need to extract logical facts from given natural texts for neuro-symbolic methods. We prepare the following propositional logics as extracting logical facts: which object is found in the observation, which direction has already been visited, and which direction the agent comes from initially. These logical values are easily calculated from visited room history and word definitions. In this experiment, we prepared 26 logical values 1 , and all the following neuro-symbolic methods used these value as input. For the evaluation metric, we focused on (1) the test reward value on the unseen (test) games and (2) the number of steps to achieve the goal on unseen games. Since we focus on the performance of generalization, we only use 50 small-size (level = 5) games for training, 50 unseen games from 5 different size (level = 5, 10, 15, 20, 25) games for test 2 , and mini-batch in training (batch size = 4). The other parameters for the game and agent follow LSTM-DQN++ (Narasimhan et al., 2015). Table 1: Average reward and number of steps (reward: higher is better / number of steps: lower is better) for each epoch on 50 unseen games with three difficulty levels. These results are from moving average (N = 100) and 5 random seeds. Training is done on only small-size games. Although neuro-only method cannot solve unseen test games, our proposed method (FOL-LNN) can solve and converge extremely faster than other SOTAs and baselines.  (Narasimhan et al., 2015) ** State-of-the-art neuro-symbolic method has same input as ours and other neuro-symbolic methods (Dong et al., 2018) We prepared five methods for an evaluation of the proposed method: • LSTM-DQN++ (Narasimhan et al., 2015): State-of-the-art neuro-only method with a simple DQN action scorer. We use this method as a baseline method for the neuro-only agent, and LSTM receives extracted embedding vector from natural text information.
• NLM-DQN (Dong et al., 2018): State-ofthe-art neuro-symbolic method. The input is propositional logical values that is also used in following baselines and proposed method. The original NLM uses the REIN-FORCE (Williams, 1992) algorithm, but in order to handle text-based games with the same setting as the other methods, we applied the DQN algorithm. In short, the method uses an NLM layer instead of an LSTM (Hochreiter and Schmidhuber, 1997) for the encoder of the LSTM-DQN++ method. We tuned the hyper-parameters from the same search space as the original paper.
• NN-DQN: Naïve neuro-symbolic baseline method. The input of the network is propositional logical values, and it uses a multi layer perceptron as the encoder of the LSTM-DQN++.
• LNN-NN-DQN: Neuro-symbolic baseline method. The method first gets propositional logical values, it converts by LNN into some conjunction values for all combinations of given logical values, and then it inputs them into a multi layer perceptron. It differs from NN-DQN in that LNN-NN-DQN has prepared conjunction nodes, which should lead to faster training in beginning of the training, and better interpretabiliity after the training.
• FOL-LNN: Our neuro-symbolic method. Table 1 shows the test reward and test step values on unseen games, and Fig. 2 shows curves. First, all the RL results with logical input were better than those with textual input. Second, our proposed method could converge much faster than the other neuro-symbolic state-of-the-art and baseline methods. Third, only our method could extract the trained rules by checking the weight value of the LNN. We attached the extracted rules from the medium level games here: where W direction is a set of words in a type of "direction" in ConceptNet. The rule for "take"-action is for taking a coin. The first conjunction rule for "go"-action is for visiting an un-visited room, and the second rule is for returning to the initial room from a dead-end. With our proposed method, we can see that these trained rules will be helpful for operating the neural agent in real use cases.

Conclusion
In this paper, we proposed a novel neuro-symbolic method for RL on text-based games. According to the evaluation on the natural language text-based game with several difficulties, our method can converge extremely faster than other state-of-theart neuro-only and neuro-symbolic methods, and extract trained logical rules for improving interpretability of the model.

Discussion about ethics
Our model is not using any sensitive contexts such as legal or health-care settings. The data set used in our experiment does not contain any sensitive information. Since our proposed neuro-symbolic RL method can extract the trained rules for interpretability of the model, the method can analyze a reason behind taken action. We are sure that if the model returns biased results, this functionality is helpful for clearing the reason for these data bias issues.