Transfer Learning with Synthetic Corpora for Spatial Role Labeling and Reasoning

Recent research shows synthetic data as a source of supervision helps pretrained language models (PLM) transfer learning to new target tasks/domains. However, this idea is less explored for spatial language. We provide two new data resources on multiple spatial language processing tasks. The first dataset is synthesized for transfer learning on spatial question answering (SQA) and spatial role labeling (SpRL). Compared to previous SQA datasets, we include a larger variety of spatial relation types and spatial expressions. Our data generation process is easily extendable with new spatial expression lexicons. The second one is a real-world SQA dataset with human-generated questions built on an existing corpus with SPRL annotations. This dataset can be used to evaluate spatial language processing models in realistic situations. We show pretraining with automatically generated data significantly improves the SOTA results on several SQA and SPRL benchmarks, particularly when the training data in the target domain is small.

The synthetic datasets often focus on specific types of relations with a small coverage of spatial semantics needed for spatial language understanding in various domains.Figure 2 indicates the coverage of sixteen spatial relation types (in Q: Is the yellow apple to the west of the yellow watermelon?Yes Three boxes called one, two and three exist in an image.Box one contains a big yellow melon and a small orange watermelon.Box two has a small yellow apple.A small orange apple is inside and touching this box.Box one is in box three.Box two is to the south of, far from and to the west of box three.A small yellow watermelon is inside box three. Q: Where is box two relative to the yellow watermelon?Left, Below, Far (a) SPARTUN -A synthetic large dataset provided as a source of supervision.
A grey car is parking in front of a grey house with brown window frames and plants on the balcony.
Q: Are the plants in front of the car?No Q: Are the plants in the house?Yes (b) RESQ -A human-generated dataset for probing the models on realistic SQA Figure 1: Two new datasets on SQA Table 1) collected from existing resources (Randell et al., 1992;Wolter, 2009;Renz and Nebel, 2007).The human-generated datasets, despite helping study the problem as evaluation benchmarks, are less helpful for training models that can reliably understand spatial language due to their small size (Mirzaee et al., 2021).
In To generate SPARTUN, we follow the idea of SPARTQA (Mirzaee et al., 2021) benchmark and generate scene graphs from a set of images.The edges in this graph yield a set of triplets such as ABOVE(blue circle, red triangle), which are used to generate a scene description (i.e., a story).
In SPARTUN, we map the spatial relation types in triplets (e.g., ABOVE) to a variety of spatial language expressions (e.g., over, north, above) to enable the transfer learning for various data domains 3 .We also build a logical spatial reasoner to compute all possible direct and indirect spatial relations between graph nodes.Then, the questions of this dataset are selected from the indirect relations.
To evaluate the effectiveness of SPARTUN in transfer learning, we created another dataset named RESQ 4 (Fig 1b).This dataset is built on MSPRL (Kordjamshidi et al., 2017) corpus while we added human-generated spatial questions and Metaphoric usages and implicit meaning are not covered in this work. 3The full list of spatial expressions used in this dataset and the dataset generation code are provided in https://github.com/HLR/SpaRTUN.
4 Real-world Spatial Questions answers to its real image descriptions.This dataset comparatively reflects more realistic challenges and complexities of the SQA problem.
We analyze the impact of SPARTUN as source of extra supervision on several SQA and SPRL benchmarks.To the best of our knowledge, we are the first to use synthetic supervision for the SPRL task.Our results show that the auto-generated data successfully improves the SOTA results on MSPRL and SPARTQA-HUMAN, which are annotated for SPRL task.Moreover, further pretraining models with SPARTUN for SQA task improves the result of previous models on RESQ, StepGame, and SPARTQA-HUMAN benchmarks.Furthermore, studying the broad coverage of spatial relation expressions of SPARTUN in realistic domains demonstrates that this feature is a key factor for transfer learning.
The contributions of this paper can be summarized as: (1) We build a new synthetic dataset to serve as a source of supervision and transfer learning for spatial language understanding tasks with broad coverage of spatial relation types and expressions (which is easily extendable); (2) We provide a human-generated dataset to evaluate the performance of transfer learning on real-world spatial question answering; (3) We evaluate the transferability of the models pretrained with SPARTUN on multiple SQA and SPRL benchmarks and show significant improvements in SOTA results.

Related Research
Requiring large amounts of annotated data is a wellknown issue in training complex deep neural mod-els (Zhu et al., 2016) that is extended to spatial language processing tasks.In our study, we noticed that all available large datasets on SQA task including bAbI (Weston et al., 2015), SPARTQA-AUTO (Mirzaee et al., 2021), and StepGame (Shi et al., 2022) are, all, synthetic.bAbI is a simple dataset that covers a limited set of relation types, spatial rules, and vocabulary.StepGame focuses on a few relation types but with more relation expressions for each and considers multiple reasoning steps.SPARTQA-AUTO, comparatively, contains more relation types and needs complex multi-hop spatial reasoning.However, it contains a single linguistic spatial expression for each relation type.All of these datasets are created based on controlled toy settings and are not comparable with real-world spatial problems in the sense of realistic language complexity and coverage of all possible relation types.SPARTQA-HUMAN (Mirzaee et al., 2021) is a human-generated version of SPARTQA-AUTO with more spatial expressions.However, this dataset is provided for probing purposes and has a small training set that is not sufficient for effectively training deep models.
For the SPRL task, MSPRL and SpaceEval (SemEval-2015 task 8) (Pustejovsky et al., 2015) are two available datasets with spatial roles and relation annotations.These are small-scale datasets for studying the SPRL problem.From the previous works which tried transfer learning on SPRL task, (Moussa et al., 2021) only used it on word embedding of their SPRL model, and (Shin et al., 2020) used PLM without any specifically designed dataset for further pretraining.These issues motivated us to create SPARTUN for further pretraining and transfer learning for SQA and SPRL.
Transfer learning has been used effectively in different NLP tasks to further fine-tune the PLMs (Razeghi et al., 2022;Alrashdi and O'Keefe, 2020;Magge et al., 2018).Besides transfer learning, several other approaches are used to tackle the lack of training data in various NLP areas, such as providing techniques to label the unlabeled data (Enayati et al., 2021), using semi-supervised models (Van Krieken et al., 2019;Li et al., 2021) or data augmentation with synthetic data (Li et al., 2019;Min et al., 2020).However, transfer learning is a simple way of using synthetic data as an extra source of supervision at no annotation cost.Compared to the augmentation methods, the data in the transfer learning only needs to be close to the target task/domain (Ma et al., 2021) and not necessarily the same.Mirzaee et al. is the first work that considers transfer learning for SQA.It shows that training models on synthetic data and finetuning with small human-generated data results in a better performance of PLMs.However, their coverage of spatial relations and expressions is insufficient for effective transfer learning to realistic domains.
Using logical reasoning for building datasets that need complex reasoning for question answering is used before in building QA datasets (Clark et al., 2020;Saeed et al., 2021).More recent efforts even use the path of reasoning and train models to follow that (Tafjord et al., 2021).However, there are no previous works to model spatial reasoning as we do here with the broad coverage of spatial logic.

Transfer Learning for Spatial Language Understanding
To evaluate transfer learning on spatial language understanding, we select two main tasks, spatial question answering (SQA) and spatial role labeling (SPRL).Given the popularity of PLMs in transfer learning (Khashabi et al., 2020;Ma et al., 2021;Clark et al., 2020), we design PLM-based models for this evaluation.In the rest of this section, we describe each task and model in detail.

Spatial Question Answering
In spatial question answering, given a scene description, the task is to answer questions about the spatial relations between entities (e.g., Figure 1).
Here, we focus on challenging questions that need multi-hop spatial reasoning over explicit relations.We consider two question types, YN (Yes/No) and FR(Find relations).The answer to YN is chosen from "Yes" or "No," and the answer to FR is chosen from a set of relation types.
We use a PLM with classification layers as a baseline for the SQA task.We use a binary classification layer for each label for questions with more than one valid answer and a multi-class classification layer for questions with a single valid answer.
To predict the answer, we pass the concatenation of the question and story to the PLM (more detail in (Devlin et al., 2019).)The final output of [CLS] token is passed to the classification layer and depending on the question type, a label or multiple labels with the highest probability are chosen as the final answer.
We train the models based on the summation of the cross-entropy losses of all binary classifiers in multi-label classification or the single crossentropy for a single classifier in multi-classification.In the multi-label setting, we remove inconsistent answers by post-processing during the inference phase.For instance, LEFT and RIGHT relations cannot be valid answers simultaneously.

Spatial Role Labeling
Spatial role labeling (Kordjamshidi et al., 2010(Kordjamshidi et al., , 2011) ) is the task of identifying and classifying the spatial roles (Trajector, Landmark, and spatial indicator) and their relations.A relation is selected from the relation types in Table 1 and assigned to each triplet of (Trajector, Spatial indicator, Landmark) extracted from the sentence.We call the former spatial role extraction and the latter spatial relation5 extraction (Figure 3).
Several neural models have been proposed to solve spatial role (Mazalov et al., 2015;Ludwig et al., 2016;Datta and Roberts, 2020).We take a similar approach to prior research (Shin et al., 2020) for the extraction of spatial roles (entities (Trajector/Landmark) and spatial indicators).
First, we separately tokenize each sentence in the context and use a PLM (which is BERT here) to compute the tokens representation.Next, we apply a BIO tagging layer on tokens representations using (O, B-entity, I-entity, B-indicator, I-indicator) tags.A Softmax layer on BIO tagger output is used to select the spatial entities and spatial indicators with the highest probability.For training, we use CrossEntropy loss given the spatial annotation.
For the spatial relation extraction model, similar to (Yao et al., 2019;Shin et al., 2020), we use BERT and a classification layer to extract correct triplets.Given the output of the spatial role extraction model, for each combination of (spatial entity(tr), spatial_indicator(sp), spatial entity(lm)) in each sentence we create an input6 and pass it to the BERT model.To indicate the position of each spatial role in the sentence, we use segment embeddings and add 1 if it is a role position and 0 otherwise.
The [CLS] output of BERT will be passed to a one-layer MLP that provides the probability for the triplet (see Fig 3 Figure 3: Spatial role labeling model includes two separately trained modules.E: entity, SP: spatial_indicators.As an example, triplet (a grey house, front , A grey car) is correct and the "spatial_type = FRONT", and (A grey car, front, a grey house) is incorrect, and the "spatial_type = NaN".
we predict the spatial type for each triplet as an auxiliary task for spatial relation extraction.To this aim, we apply another multi-class classification layer7 on the same [CLS] token.To train the model, we use a joint loss function for both relation and type modules (more detail in Appendix B).

SPARTUN: Dataset Construction
To provide a source of supervision for spatial language understanding tasks, we generate a synthetic dataset with SQA format that contains SPRL annotation of sentences.We build this dataset by expanding SPARTQA in multiple aspects.The following additional features are considered in creating SPARTUN: F1) A broad coverage of various types of spatial relations and including rules of reasoning over their combinations (e.g.NTPP(a, b), LEFT(b, c) → LEFT(a, c) ) in various domains.F2) A broad coverage of spatial language expressions and utterances used in various domains.F3) Including extra annotations such as the supporting facts and number of reasoning steps for SQA to be used in complex modeling.In the rest of this section, we describe the details of creating SPARTUN and the way we support the above mentioned features.Figure 4 depicts SPARTUN data construction flow.Spatial Relation Computation.Following SPARTQA-AUTO, we use the NLVR scene graphs (Suhr et al., 2017) and compute relations between objects in each block based on their given coordinates.NLVR is limited to 2D relation types8 , therefore to add more dimensions (FRONT and BEHIND), we randomly change the LEFT and RIGHT to BEHIND and FRONT in a subset of examples.Moreover, there are no relations between blocks in NLVR descriptions.
To expand the types of relations, we extend this limitation and randomly assign relations9 to the blocks while ensuring the spatial constraints are not violated.Then, we create a new scene graph with computed spatial relations.The nodes in this graph represent the entities (objects or blocks), and the directed edges are the spatial relations.
Question Selection.There are several paths between each pair of entities in the generated scene graph.We call a path valid if at least one relation can be inferred between its start and end nodes can be inferred.For example, in Figure 4, NTPP(A, X), FRONT(X, Y ), TPPI(Y, B) is valid since it results in FRONT(A, B) while NTPP(A, X), NTPPI(X, C) is not a valid path -there is no rules of reasoning that can be applied to infer new relations.
To verify the validity of each path, we pass its edges, represented as triplets in the predicatearguments form to a logical spatial reasoner (imple-mented in Prolog) and query all possible relations between the pair.The number of triplets in each path represents the number of reasoning steps for inferring the relation.
We generate the question triplets from the paths with the most steps of reasoning (edges).This question will ask about the spatial relationship between the head and tail entity of the selected path.The triplets in this path are used to generate the story and are annotated as supporting facts.Additionally, the story will include additional information (extra triplets) unnecessary for answering the question to increase the complexity of the task.Spatial Reasoner.We implement several rules (in the form of Horn clauses shown in Table 2) in Prolog, which express the logic between the relation types (described in Table 1) in various formalisms and model the logical spatial reasoning computation (see Appendix B.1).Compared to previous tools (Wolter, 2009), we are the first to include the spatial, logical computation between multiple formalisms.This reasoner validates the question/queries based on the given facts.For instance, by using the Combination rule in Table 2 over the set of facts {NTPP(A, X), FRONT(X, Y ), TPPI(Y, B)}, the reasoner returns True for the query FRONT(A, B) and False for FRONT(B, A) or BEHIND(A, B).
Text generation.The scene description is generated from the selected story triplets in question selection phase and using a publicly available contextfree grammar (CFG) provided in SPARTQA-AUTO.However, we increase the variety of spatial expressions by using a vocabulary of various entity properties and relation expressions (e.g., above, over, or north for ABOVE relation type) taken from exist- ing resources (Freeman, 1975;Mark et al., 1989;Lockwood et al., 2006;Stock et al., 2022;Herskovits, 1986) We map the relation types and the entity properties to the lexical forms in our collected vocabulary.
For the question text, we generate the entity description and relation expression for each question triplet.The entity description is generated based on a subset of its properties in the story.For instance, an expression such as "a black object" can be generated to refer to both "a big black circle" and "a black rectangle".We generate two question types, YN (Yes/No) questions that ask whether a specific relation exists between two entities, and FR (Find Relations) questions that ask about all possible relations between them.To make YN questions more complex, we add quantifiers ("all" and "any") to the entities' descriptions.
Our text generation method can flexibly use an extended vocabulary to provide a richer corpus to supervise new target tasks when required.
Finding Answers.We search all entities in the story based on the entity descriptions (e.g., all circles, a black object) in each question and use the spatial reasoner to find the final answer.
SpRL Annotations.Along with generating the sentences for the story and questions, we automatically annotate the described spatial configurations with spatial roles and relations (trajector, landmark, spatial indicator, spatial type, triplet, entity ids).These annotations are based on a previously proposed annotation scheme of SPRL and provide free annotations for the SPRL task.
To generate SPARTUN, we use 6.6k NLVR scene graphs for training and 1k for each dev and test set.We collect 20k training, 3k dev, and 3k test examples for each FR and YN question (see Table 3)10 .On average, each story of SPARTUN contains eight sentences and 91 tokens that describe on average 10 relations between different mentions of entities.More details about the dataset statistics can be seen in Appendix A.1.

Experimental Results
The focus of this paper is to provide a generic source of supervision for spatial language understanding tasks rather than proposing new techniques or architectures.Therefore, in the experiments, we analyze the impact of SPARTUN on SQA and SPRL using the PLM-based models described in Section 3.
In all experiments, we compare the performance of models fine-tuned with the target datasets with and without further pretraining on synthetic supervision (SynSup).All codes are publicly available 11 .The details of experimental settings and hyperparameters of datasets are provided in the Appendix.

Spatial Question Answering
Here, we evaluate the impact of SPARTUN and compare it with the supervision received from other existing synthetic datasets.Since the datasets that we use contain different question types, we supervise the models based on the same question type as the target task 12 .
The baselines for all experiments include a majority baseline (MB) which predicts the most repeated label as the answer to all questions, and a pretrained language model, that is, BERT here.We also report the human accuracy in answering the questions for the human-generated datasets 13 .For all experiments, to evaluate the models, we measure the accuracy which is the percentage of correct predictions in the test sets.StepGame is a synthetic SQA dataset containing FR questions which need k reasoning steps to be answered (k = 1 to 10).The answer to each question is one relation in {left, right, below, above, lower-left, upper-right, lower-right, upper-left} set.

SQA Evaluation Datasets
RESQ We created this dataset to reflect the natural complexity of real-world spatial descriptions and questions.We asked three volunteers (Englishspeaking undergrad students) to generate Yes/No questions for MSPRL dataset that contains complex human-generated sentences.The questions require at least one step of reasoning.The advantage of RESQ is that the human-generated spatial descriptions and their spatial annotations already exist in the original dataset.The statistics of this dataset are provided in Appendix A.2.One of the challenges of the RESQ, which is not addressed here, is that the questions require spatial commonsense knowledge in addition to capturing 14 Since the relation types are not used in SPARTQA, the answer is selected from a fixed set of relation expressions the spatial semantics.For example, by using commonsense knowledge from the sentence, "a lamp hanging on the ceiling", we can infer that the lamp is above all the objects in the room.To compute the human accuracy, we asked two volunteers to answer 100 questions from the test set of RESQ and compute the accuracy.

Transfer Learning in SQA
The following experiments demonstrate the impact of transfer learning for SQA benchmarks considering different supervisions.Due to the simplicity of bAbI dataset, PLM can solve this benchmark with 100% accuracy (Mirzaee et al., 2021).Hence we run our experiment on only 1k and 500 training examples of task 17 and task 19, respectively.Table 4 demonstrates the impact of synthetic supervision on both tasks of bAbI.The results with various synthetic data are fairly similar for these two tasks.However, pretraining the model with the simple version of SPARTUN, named SPARTUN-S, performs better than other synthetic datasets on task 17.This can be due to the fewer relation expressions in SPARTUN-S, which follows the same structure as task 17.
In the next experiment, we investigate the impact of SPARTUN on SPARTQA-HUMAN result.As expected, all the PLM-based models almost solve the questions with one step of reasoning (i.e.where the answer directly exists in the text).However, with increasing the steps of reasoning, the performance of the models decreases.Comparing the impact of different synthetic supervision, SPARTUN achieves the best result on k > 3.For questions with k <= 3, SPARTUN-S achieves competitive similar results compared to SPARTUN.Overall, the performance gap in SPARTUN-S, SPARTQA-AUTO and SPARTUN shows that more coverage of relation expressions in SPARTUN is effective.

Comparing the results in
In the next experiment, we show the influence of SPARTUN on real-world examples, which contain more types of spatial relations and need more rules of reasoning to be solved.Table 7 shows the result of transfer learning on RESQ.This result shows that the limited coverage of spatial relations and expression in SPARTQA-AUTO impacts the performance of BERT negatively.However, further pretraining BERT on SPARTUN-S improves the result on RESQ.This can be due to the higher coverage of relation types in SPARTUN-S than SPARTQA-AUTO.Using SPARTUN for further pretraining BERT has the best performance and improves the result by 5.5%, indicating its advantage for transferring knowledge to solve real-world spatial challenges.

Spatial Role Labeling
Here, we analyze the influence of the extra synthetic supervision on SPRL task when evaluated on human-generated datasets.Table 8 shows the number of sentences in each SPRL benchmarks.
The pipeline model provided in Section 3, contains two main parts, a model for spatial role extraction (SRol) and a model for spatial relation extraction (SRel), which we analyze separately.
We further pretrain the BERT module in these models and then fine-tune it on the target domain.We use Macro F1-score (mean of F1 for all classes) to evaluate the performance of the SRol and SRel models.MSPRL is a human-curated dataset provided on SPRL task.This dataset contains spatial description of real-world images and corresponding SPRL annotations (see Appendix A.6).

SPRL Evaluation Datasets
SPARTQA-HUMAN did not contain SPRL annotations.Hence, we asked two expert volunteers to annotate the story/questions of this dataset.Then another expert annotator checked the annotation and discarded the erroneous ones.As a result, half of this training data is annotated with SPRL tags.

Transfer learning in SPRL
Table 9 demonstrates the influence of synthetic supervision in spatial role extraction evaluated on MSPRL and SPARTQA-HUMAN.
We compare the result of SRol model with the previous SOTA, "R-Inf" (Manzoor and Kordjamshidi, 2018), on MSPRL dataset.R-Inf uses external multi-modal resources and global inference.All of the BERT-based SRol models outperform the R-Inf, which shows the power of PLMs for this task.However, since the accuracy of the SRol is already very high, using synthetic supervision shows no improvements compared to the model that only trained with MSPRL training set for the SRol.In contrast, on SPARTQA-HUMAN, using synthetic supervision helps the model perform better.Especially, using SPARTUN increases the performance of the SRol model dramatically, by 15%.
In  supervision due to the limited relation expressions used in this data.
In conclusion, our experiments show the efficiency of SPARTUN in improving the performance of models on different benchmarks due to the flexible coverage of relation types and expressions.

Conclusion and Future Work
We created a new synthetic dataset as a source of supervision for transfer learning for spatial question answering (SQA) and spatial role labeling (SPRL) tasks.We show that expanding the coverage of relation types and combinations and spatial language expressions can provide a more robust source of supervision for pretraining and transfer learning.As a result, this data improves the models' performance in many experimental scenarios on both tasks when tested on various evaluation benchmarks.This data includes rules of spatial reasoning and the chain of logical reasoning for answering the questions that can be used for further research in the future.
Moreover, we provide a human-generated dataset on a realistic SQA task that can be used to evaluate the models and methods for spatial language understanding related tasks in real-world problems.This data is an extension of a previous benchmark on SPRL task with spatial semantic annotations.As a result, this dataset contains annotations for both SPRL and SQA tasks.
In future work, we plan to investigate explicit spatial reasoning over text by neuro-symbolic models.Moreover, using our methodology to generate synthetic spatial corpus in other languages or for other types of reasoning, such as temporal reasoning, is an exciting direction for future research.

Limitations
Though we aim for a broad coverage of relation types and relations, we collected this from our available resources and spatial lexicons but this is not by any means complete.There can be relations and expressions that are not covered.In particular, the relation expressions are limited to verbs and prepositions.The performance and reasoning ability of our models is improved with transfer learning but this is, certainly, far from the natural language understanding desiderata.Our models are based on large language models and need GPU resources to execute.

A Datasets
A.1 SPARTUN As we described in Section 4 to cover more spatial expressions and spatial relation types, we provide an extendable vocabulary of these spatial phenomena.The entire vocabulary of supported relation expressions and entity properties are provided in Figure 10.Statistic information: Each example in SPAR-TUN contains a story that describe the spatial relation between entities and some questions which ask about indirect relations between entities.On average, each story contains eight sentences and 91 tokens, which describe ten relations on average.
We follow SPARTQA for dataset split.The number of questions in each train, dev, and test sets is provided in Table 3. YN questions can have two answers "Yes," which is the answer to 54% of questions, and "No," which is the answer to 46% of questions.

A.2 RESQ
The RESQ dataset generated over the context of MSPRL dataset.For each group of sentences ( describing an image), we ask three volunteers (English-speaking undergraduate students) to generate at least four Yes/No questions.On average, they spent 20 minutes generating questions for each group of sentences which, in total, they spent 210 hours generating the whole data.After gathering the data, another undergrad student check the questions and remove the incorrect ones and keep the rest.The train set is provided on the train set of MSPRL, and since it does not have a dev set, we split the 32% of test data (equal to 20% of the training set) and keep it as the dev set.50% of questions in this data are "Yes" and 50% are "No".The static information of this dataset comes in Table 3.
To compute the human accuracy we ask two undergraduate students, one from those who create the questions and one new volunteer to answer 100 questions from the test set of RESQ.In the end a third students grade their answers.A.4 SPARTQA SPARTQA-AUTO contains more complex textual context (story) and questions requiring complex multi-hop spatial reasoning (e.g.Fig 6).This datasets contains one large synthesized (SPARTQA-AUTO) and a small human-generated (SPARTQA-HUMAN) subsets.
One of the advantages of SPARTQA is the SPRL annotation of whole data (Contexts and Questions) provided with the main dataset.In this work, we also recruited two experts annotator which spent 270 hours annotating 2k sentences in SPARTQA-HUMAN using WebAnno framework15 .Then another expert annotator checks their annotation and discards the wrong ones.The statistic information of SPARTQA comes in Table 3.

A.5 StepGame
StepGame is another synthesized datasets described in this paper.You can check a sample of this dataset in Figure 7.
A.6 MSPRL SPRL is the task of identifying and classifying the spatial arguments of the spatial expressions mentioned in a sentence (Kordjamshidi et al., 2010).The MSPRL (Kordjamshidi et al., 2017) is a dataset provided on SPRL task..The statistic data of this dataset comes in Table 11.A SPRL can have following spatial semantic component (Zlatev, 2008) on the static environment, trajector (the main QUESTIONS: FB: Which block(s) has a medium thing that is below a black square?A, B, C FB: Which block(s) doesn't have any blue square that is to the left of a medium square?A, B FR: What is the relation between the medium black square which is in block C and the medium square that is below a medium black square that is touching the bottom edge of a block?Left CO: Which object is above a medium black square? the medium black square which is in block C or medium black square number two? medium black square number two YN: Is there a square that is below medium square number two above all medium black squares that are touching the bottom edge of a block?Yes

STORY:
We have three blocks, A, B and C. Block B is to the right of block C and it is below block A. Block A has two black medium squares.Medium black square number one is below medium black square number two and a medium blue square.It is touching the bottom edge of this block.The medium blue square is below medium black square number two.Block B contains one medium black square.Block C contains one medium blue square and one medium black square.The medium blue square is below the medium black square.entity), landmark(the reference entity), and spa-tial_indicator (the spatial term describing the relationship between trajector and landmark.).The dynamic environment can also have path, region, direction, and motion.To understand MSPRL better you can take a look at Figure 8.In this figure the spatial value assigned to each spatial triplet can be chosen from Table 1.

B Models and modules
We use the huggingFace 16 The rest of experimental setting such as number of epochs, batch size, and learning rate are provided in Table 13.This settings are chosen after trial and test on the dev set of the target task.10a) for relation types.

B.1 Logic-based spatial reasoner
We consider the logic rules mentioned in Figure 2 and in the form of the Horn clauses.we collect the different combinations of spatial relations mentioned in Table 1 and implement the logic-based spatial reasoner.Figure 9a shows an example of some parts of our code on LEFT relation.In Figure 9b, on the left, some facts are given, and the query "ntppi(room, X)" ask about all objects that existed in the room.Below each query, there are all possible predictions for them."to the left of", "on the left side of", "to the left-hand side of" "west of", "to the west of" "at 9:00 position relative to", "at 9: 00 position regarding to", "at 9 o'clock position regarding to" right "to the right of", "on the right side of", "to the right-hand side of" "east of", "to the east of" "at 3:00 position relative to", "at 3: 00 position regarding to", "at 3 o'clock position regarding to" below "above", "over" "north of", "to the north of" "at 12:00 position relative to", "at 12:00 position regarding to", "at 12 o'clock position regarding to" above "below", "under" "south of", "to the south of" "at 6:00 position relative to", "at 6: 00 position regarding to", "at 6 o'clock position regarding to" behind "behind" front "in front of" Distances far "far from", "farther from", "away from" near "near to", "close to"

Figure 4 :
Figure4: The data construction flow of SPARTUN.First, we generate scene graphs from NLVR images.Then a spatial reasoner validates each path between each pair of entities in this graph.All facts (F ) in the selected path and some extra facts (E) from the scene graph are selected as story triplets, and the start and end nodes of the path are selected as question triplets.Finally, we pass all triplets to a text generation module and compute the final answer.We ignore paths with length one (e.g., A( ABOV E)C) and only keep questions that need multi-hop reasoning.

A. 3 bAbI
This dataset is automatically generated data including samples with two sentences describing relationships between three objects and Yes/No questions asking about the existence of a relation between two objects (Fig5) focuses on multi-hop spatial reasoning question answering.

Figure 5 :
Figure 5: An example of bAbI bAbI task 19, contain questions asking about the directed path from one room to another.More statistic information of this dataset comes in table 3.

Figure 7 :
Figure 7: StepGame.An example of questions which need 10 steps of reasoning.
implementation of pretrained BERT base which has 768 hidden dimensions.All models are trained on the training set, evaluated on the dev set, and reported the result on the test set.For training, we train the model until no changes happen on the dev set and then store and use the best model on the dev set.We use AdamW ((Loshchilov and Hutter, 2017)) optimizer on all models and modules.For SQA tasks we use Focal Loss(Lin et al., 2017) with γ = 2.For spatial argument extraction, we use cross-entropy loss for BIO-tagging, and for spatial relation extraction, we use the summation of loss for each spatial relation and relation type classification part.Loss = CrossEntropyLoss(p ′ , y ′ ) + BCELoss(p, y) (a) Examaple of implemented rule clauses in Prolog.(b) Example of Facts, Query, and answer of implemented model

Figure
Figure 9: Logic-bases spatial reasoner Figure 10: The supported relation expression and entities properties in SPARTUN, which can easily extended based on the target task.

Table 1 :
this work, we build a new synthetic dataset on SQA, called SPARTUN 1 (Fig 1a) to provide a source of supervision with broad coverage of spatial relation types and expressions 2 .Spatial relation types and examples of spatial language expressions.
Figure 2: The comparative coverage of relation types based on Table1for SQA datasets.
).Compared to the prior research,

Table 3 :
Size of SQA benchmarks.
SPARTQA-HUMAN is a small human-generated dataset containing YN and FR questions that need multi-hop spatial reasoning.The answer of YN questions is in {Yes, No,DK} where DK denotes Do not Know is used when the answer cannot be inferred from the context.The answer to FR questions is in {left, right, above, below, near to, far from, touching, DK} 14 .

Table 5 ,
we find that even though the classification layer for SPARTQA-

Table 6 :
Result of models with and without extra synthetic supervision on StepGame.

Table 7 :
Results with and without extra supervision on ReSQ.The Human accuracy is the performance of human on answering a subset of test set.
AUTO and SPARTQA-HUMAN are the same, the model trained on SPARTUN has a better transferability.It achieves 2.6% better accuracy on FR and 9% better accuracy on YN questions compared to SPARTQA-AUTO.YN is, yet, the most challenging question type in SPARTQA-HUMAN and none of the PLM-based models can reach even the simple majority baseline.

Table 6
demonstrates our experiments on StepGame.BERT without any extra supervision, significantly, outperforms the best reported model in Shi et al., TP-MANN, which is based on a neural memory network.

Table 8 :
Number of sentences of SPRL benchmarks.To train the SPARTQA-AUTO, we only use the 3k training examples (23 -25k sentences).
table 10, we show the result of SRel model (containing spatial relation extraction and spatial relation type classification) for spatial relation extraction, with and without extra supervision from synthetic data.Same as SRol model, extra supervision from SPARTUN achieves the best result when tested on SPARTQA-HUMAN.For MSPRL, we compared the SRel model with R-Inf on spatial relation extraction.As table 10 demonstrates we improve the SOTA by 2.6% on F1 measure using SPARTUN as synthetic supervision.Also, model further pretrained on SPARTQA-AUTO gets lower result than model with no extra

Table 12 :
Result of BERT (SQA) model trained and test on two synthetic supervision data.Besides, The result of BERT model trained on SPARTUN and SPARTUN and tested on the same dataset are provided in Table 12.SPARTUN-Simple only contains one spatial expresion for each relation types, and SPARTUN-Clock contains all relation expression plus clock expressions (Column 5 in Table