DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions

Modern machine learning relies on datasets to develop and validate research ideas. Given the growth of publicly available data, finding the right dataset to use is increasingly difficult. Any research question imposes explicit and implicit constraints on how well a given dataset will enable researchers to answer this question, such as dataset size, modality, and domain. We operationalize the task of recommending datasets given a short natural language description of a research idea, to help people find relevant datasets for their needs. Dataset recommendation poses unique challenges as an information retrieval problem; datasets are hard to directly index for search and there are no corpora readily available for this task. To facilitate this task, we build the DataFinder Dataset which consists of a larger automatically-constructed training set (17.5K queries) and a smaller expert-annotated evaluation set (392 queries). Using this data, we compare various information retrieval algorithms on our test set and present a superior bi-encoder retriever for text-based dataset recommendation. This system, trained on the DataFinder Dataset, finds more relevant search results than existing third-party dataset search engines. To encourage progress on dataset recommendation, we release our dataset and models to the public.


Introduction
Innovation in modern machine learning (ML) depends on datasets. The revolution of neural network models in computer vision (Krizhevsky et al., 2012) was enabled by the ImageNet Large Scale Visual Recognition Challenge (Deng et al., 2009). Similarly, data-driven models for syntactic parsing saw rapid development after adopting the Penn Treebank (Marcus et al., 1993;Palmer and Xue, 2010).  Figure 1: Queries for dataset recommendation impose constraints on the type of dataset desired. Keyword queries make these constraints explicit, while fullsentence queries impose implicit constraints. Ground truth relevant datasets for this query are colored in blue.
With the growth of research in ML and artificial intelligence (AI), there are hundreds of datasets published every year (shown in Figure 2). Knowing which to use for a given research idea can be difficult (Paullada et al., 2021). To illustrate, consider a real query from a graduate student who says, "I want to use adversarial learning to perform domain adaptation for semantic segmentation of images." They have implicitly issued two requirements: they need a dataset for semantic segmentation of images, and they want datasets that include diverse visual domains. A researcher may intuitively select popular, generic semantic segmentation datasets like COCO (Lin et al., 2014) or ADE20K (Zhou et al., 2019), but these are insufficient to cover the query's requirement of supporting domain adaptation. How can we infer the intent of the researcher and make appropriate recommendations?
To study this problem, we introduce the task of "dataset recommendation": given a full-sentence description or keywords describing a research topic, recommend datasets that could support research in this topic ( §2). A concrete example is shown in Year # of Datasets Figure 2: The number of public AI datasets has exploded in recent years. Here we show the # released from 1990 to 2022 according to Papers with Code. 2 Figure 1. We frame this task as an information retrieval problem (Manning et al., 2005), where the search collection is a set of datasets which are represented textually with dataset descriptions 3 , structured metadata, and published "citances"references from published papers that use each dataset (Nakov et al., 2004). This framework allows us to rigorously track the recommender performance with standard ranking metrics such as mean reciprocal rank (Radev et al., 2002).
To strengthen evaluation, we build a dataset, the DataFinder Dataset, to measure how well we can recommend datasets for a given description ( §3). As a proxy for real-world queries for our dataset recommendation engine, we construct queries from paper abstracts to simulate researchers' historical information needs. We then identify the datasets used in a given paper, either through manual annotations (for our small test set) or using heuristic matching (for our large training set). To our knowledge, this is the first corpus available for studying dataset recommendation, and we believe this can serve as a challenging testbed for researchers interested in representing and searching complex data.
We evaluate three existing ranking algorithms on our dataset and task formation, as a step towards solving this task: BM25 (Robertson and Zaragoza, 2009), nearest neighbor retrieval, and dense retrieval with neural bi-encoders (Karpukhin et al., 2020). BM-25 is a standard baseline for text search, nearest neighbor retrieval lets us measure the degree to which this task requires generalization to new queries, and bi-encoders are among the most effective search models used today (Zhong et al., 2022). Compared with third-party keywordcentric dataset search engines, a bi-encoder model trained on DataFinder is far more effective at finding relevant datasets. We show that finetuning the bi-encoder on our training set is crucial for good 3 From www.paperswithcode.com performance. However, we observe that this model is as effective when trained and tested on keyphrase queries as on full-sentence queries, suggesting that there is room for improvement in automatically understanding full-sentence queries.

Dataset Recommendation Task
We establish a new task for automatically recommending relevant datasets given a description of a data-driven system. Given a query q and a set of datasets D, retrieve the most relevant subset R ⊂ D one could use to test the idea described in q. Figure 1 illustrates this with a real query written by a graduate student.
The query q can take two forms: either a keyword query (the predominant interface for dataset search today (Chapman et al., 2019)) or a fullsentence description. Textual descriptions offer a more flexible input to the recommendation system, with the ability to implicitly specify constraints based on what a researcher wants to study, without needing to carefully construct keywords a priori.
Evaluation Metrics Our task framing naturally leads to evaluation by information retrieval metrics that estimate search relevance. In our experiments, we use four common metrics included in the trec_eval package, 4 a standard evaluation tool used in the IR community: • Precision@k: The proportion of relevant items in top k retrieved datasets. If P@k is 1, then every retrieved document is valuable.
• Recall@k: The fraction of relevant items that are retrieved. If R@k is 1, then the search results are comprehensive.
• Mean Average Precision (MAP): Assuming we have m relevant datasets in total, and k i is the rank of the i th relevant dataset, MAP is calculated as m i P@k i /m (Manning et al., 2005). High MAP indicates strong average search quality over all relevant datasets.
• Mean Reciprocal Rank (MRR): The average of the inverse of the ranks at which the first relevant item was retrieved. Assuming R i is the rank of the i-th relevant item in the retrieved result, M RR is calculated as m i R i /m. High MRR means a user sees at least some relevant datasets early in the search results. S2ORC (Lo et al, 2020) Dataset Tagger   Qry  1 Qry 2 Galactica (Taylor et al, 2022) Relevant Datasets Training Queries SciREX (Jain et al, 2020) Qry

The DataFinder Dataset
To support this task, we construct a dataset called The DataFinder Dataset consisting of (q, R) pairs extracted from published English-language scientific proceedings, where each q is either a fullsentence description or a keyword query. We collect a large training set through an automated method (for scalability), and we collect a smaller test set using real users' annotations (for reliable and realistic model evaluation). In both cases, our data collection contains two primary steps: (1) collecting search queries q that a user would use to describe their dataset needs, and (2) identifying relevant datasets R that match the query. Our final training and test sets contain 17495 and 392 queries, respectively. Figure 3 summarizes our data collection approach. We explain the details below and provide further discussion of the limitations of our dataset in the Limitations section. We will release our data under a permissive CC-BY License.

Collection of Datasets
In our task definition, we search over the collection of datasets listed on Papers With Code, a large public index of papers which includes metadata for over 7000 datasets and benchmarks. For most datasets, Papers With Code Datasets stores a short human-written dataset description, a list of different names used to refer to the dataset (known as "variants"), and structured metadata such as the year released, the number of papers reported as using the dataset, the tasks contained, and the the modality of data. Many datasets also include the paper that introduced the dataset. We used the dataset description, structured metadata, and the introducing paper's title to textually represent each dataset, and we analyze this design decision in §5.4.

Training Set Construction
To ensure scalability for the training set, we rely on a large corpus of scientific papers, S2ORC (Lo et al., 2020). We extract nearly 20,000 abstracts from AI papers that use datasets. To overcome the high cost of manually-annotating queries or relevant datasets, we instead simulate annotations with few-shot-learning and rule-based methods.

Query Collection
We extract queries from paper abstracts because, intuitively, an abstract will contain the most salient characteristics behind a research idea or contribution. As a result, it is an ideal source for comprehensively collecting potential implicit constraints as shown in Figure 1. We simulate query collection with the 6.7B parameter version of Galactica (Taylor et al., 2022), a large scientific language model that supports fewshot learning. In our prompt, we give the model an abstract and ask it to first extract five keyphrases: the tasks mentioned in paper, the task domain of the paper (e.g., biomedical or aerial), the modality of data required, the language of data or labels required, and the length of text required (sentencelevel, paragraph-level, or none mentioned). We then ask Galactica to generate a full query containing any salient keyphrases. We perform few-shot learning using 3 examples in the prompt to guide the model. Our prompt is shown in Appendix A.
Relevant Datasets For our training set, relevant datasets are automatically labeled using the body text of a paper. 5 We apply a rule-based procedure to identify the dataset used in a given paper (corresponding to an abstract whose query has been auto-labeled). For each paper, we tag all datasets that satisfy two conditions: the paper must cite the paper that introduces the dataset, and the paper must mention the dataset by name twice. 6 This tagging procedure is restrictive and emphasizes precision (i.e., an identified dataset is indeed used in the paper) over recall (i.e., all the used datasets are identified). Nonetheless, using this procedure, we tag 17,495 papers from S2ORC with at least one dataset from our collection of datasets.
To estimate the quality of these tagged labels, we manually examined 200 tagged paper-dataset pairs. Each pair was labeled as correct if the paper authors would have realistically had to download the dataset in order to write the paper. 92.5% (185/200) of dataset tags were deemed correct.

Test Set Construction
To accurately approximate how humans might search for datasets, we employed AI researchers and practitioners to annotate our test set. As mentioned above, the dataset collection requires both query collection and relevant dataset collection. We use SciREX (Jain et al., 2020), a humanannotated set of 438 full-text papers from major AI venues originally developed for research into full-text information extraction, as the basis of our test set. We choose this dataset because it naturally supports our dataset collection described below.
Query Collection We collect search queries by asking annotators to digest, extract, and rephrase key information in research paper abstracts.
Annotators. To ensure domain expertise, we recruited 27 students, faculty, and recent alumni of graduate programs in machine learning, computer vision, robotics, NLP, and statistics from major US universities. We recruited 23 annotators on a voluntary basis through word of mouth; for the rest, we offered 10 USD in compensation. We sent each annotator a Google Form that contained between 10 and 20 abstracts to annotate. The instructions provided for that form are shown in Appendix B.
Annotation structure. For each abstract, we asked annotators to extract metadata regarding the abstract's task, domain, modality, language of data required, and length of data required. These metadata serve as keyphrase queries. Then, based on these keyphrases, we also ask the annotator to write a sentence that best reflects the dataset need of the given paper/abstract, which becomes the fullsentence query. Qualitatively, we found that the keyphrases helped annotators better ground and taining "results", "experiment", "evaluation", "result", "training", or "testing", to avoid non-salient dataset mentions, such as those commonly occurring in "related work". concretize their queries, and the queries often contain (a subset of) these keyphrases.
Model assistance. To encourage more efficient labeling (Wang et al., 2021), we provided autosuggestions for each field from GPT-3 (Brown et al., 2020) and Galactica 6.7B (Taylor et al., 2022) to help annotators. We note that annotators rarely applied these suggestions directly -annotators accepted the final full-sentence query generated by either large language model only 7% of the time.
Relevant Datasets For each paper, SciREX contains annotations for mentions of all "salient" datasets, defined as datasets that "take part in the results of the article" (Jain et al., 2020). We used these annotations as initial suggestions for the datasets used in each paper. The authors of this paper then skimmed all 438 papers in SciREX and noted the datasets used in each paper. 46 papers were omitted because they either used datasets not listed on Papers With Code or were purely theorybased papers with no relevant datasets, leaving a final set of 392 test examples.
We double-annotated 10 papers with the datasets used. The annotators labeled the exact same set of datasets for 8 out of 10 papers, with a Fleiss-Davies kappa of 0.667, suggesting that inter-annotator agreement for our "relevant dataset" annotations is substantial (Davies and Fleiss, 1982;Loper and Bird, 2002).

Dataset Analysis
Using this set of paper-dataset tags, what can we learn about how researchers use datasets?
Our final collected dataset contains 17,495 training queries and 392 test queries. The training examples usually associate queries with a single dataset much more frequently than our test set does. This is due to our rule-based tagging scheme, which emphasizes precise labels over recall. Meanwhile, the median query from our expert-annotated test set  Figure 5: The distribution of the number of datasets tagged in each paper, in train and test sets had 3 relevant datasets associated with it. We also observed interesting dataset usage patterns: • Researchers tend to converge towards popular datasets. Analyzing dataset usage by community, 7 we find that in all fields, among all papers that use some publicly available dataset, more than 50% papers in our training set use at least one of the top-5 most popular datasets.
Most surprisingly, nearly half of the papers tagged in the robotics community use the KITTI dataset (Geiger et al., 2013). • Researchers tend to rely on recent datasets.
In Figure 4, we see the distribution of relative ages of datasets used (i.e., the year between when a dataset is published, and when a corresponding paper uses it for experiments). In Figure 4, We observe that the average dataset used by a paper was released 5 years before the paper's publication (with a median of 5.6 years), but we also see a significant long tail of older datasets. This means that while some papers use traditional datasets, most papers exclusively use recently published datasets.
These patterns hint that researchers might overlook less cited datasets that match their needs in favor of standard status-quo datasets. This motivates the need for nuanced dataset recommendation.

Experimental Setup on DataFinder
How do popular methods perform on our new task and new dataset? How does our new paradigm differ from existing commercial search engines? In this section, we describe a set of standard methods which we benchmark, and we consider which thirdparty search engines to use for comparison. 7 We define "communities" by publication venues: ACL, EMNLP, NAACL, TACL, COLING for NLP, CVPR, ICCV, WACV for Vision, IROS, ICRA, IJRR for Robotics, and NeurIPS, ICML ICLR for Machine Learning. We include proceedings from associated workshops in each community.

Task Framing
We formulate dataset recommendation as a ranking task. Given a query q and a search corpus of datasets D, rank the datasets d ∈ D based on a query-dataset similarity function sim(q, d) and return the top k datasets. We compare three ways of defining sim(q, d): term-based retrieval, nearestneighbor retrieval, and neural retrieval.

Models to Benchmark
To retrieve datasets for a query, we find the nearest datasets to that query in a vector space. We represent each query and dataset in a vector space using three different approaches: Term-Based Retrieval We evaluated a BM25 retriever for this task, since this is a standard baseline algorithm for information retrieval. We implement BM25 (Robertson and Walker, 1999) using Pyserini (Lin et al., 2021). 8 Nearest-Neighbor Retrieval To understand the extent to which this task requires generalization to new queries unseen at training time, we experiment with direct k-nearest-neighbor retrieval against the training set. For a new query, we identify the most similar queries in the training set and return the relevant datasets from these training set examples. In other words, each dataset is represented by vectors corresponding to all training set queries attached to that dataset. In practice we investigate two types of feature extractors: TF-IDF (Jones, 2004) and SciBERT (Beltagy et al., 2019).

Neural Retrieval
We implement a bi-encoder retriever using the Tevatron package. 9 In this framework, we encode each query and document into a shared vector space and estimate similarity via the inner product between query and document vectors. We represent each document with the BERT embedding (Devlin et al., 2019) of its [CLS] token: where cls(·) denotes the operation of accessing the [CLS] token representation from the contextual encoding (Gao et al., 2021). For retrieval, we separately encode all queries and documents and retrieve using efficient similarity search. Following recent work (Karpukhin et al., 2020), we minimize a contrastive loss and select hard negatives using

Comparison with Search Engines
Besides benchmarking existing methods, we also compare the methods enabled by our new data recommendation task against the standard paradigm for dataset search -to use a conventional search engine with short queries . We measured the performance of third-party dataset search engines taking as input either keyword queries or full-sentence method descriptions.
We compare on our test set with two third-party systems-Google Dataset Search 10 (Brickley et al., 2019) and Papers with Code 11 search. Google Dataset Search supports a large dataset collection, so we limit results to those from Papers with Code to allow comparison with the ground truth.
Our test set annotators frequently entered multiple keyphrases for each keyphrase type (e.g. "question answering, recognizing textual entailment" for the Task field). We constructed multiple queries by taking the Cartesian product of each set of keyphrases from each field, deduplicating tokens that occurred multiple times in each query. After running each query against a commercial search engine, results were combined using balanced interleaving (Joachims, 2002

Time Filtering
The queries in our test set were made from papers published between 2012 and 2020 12 , with median year 2017. In contrast, half the datasets in our search corpus were introduced in 2018 or later.
To account for this discrepancy, for each query q, we only rank the subset of datasets D ′ = {d ∈ D | year(d) ≤ year(q)} that were introduced in the same year or earlier than the query.

Benchmarking and Comparisons
Benchmarking shows that DataFinder benefits from deep semantic matching. In Table 1, we report retrieval metrics on the methods described 12 We could not include more recent papers in our query construction process, because SciREX was released in 2020.  in §4. To determine the standard deviation of each metric, we use bootstrap resampling (Koehn, 2004) over all test set queries. Term-based retrieval (BM25) performs poorly in this setting, while the neural bi-encoder model excels. This suggests our task requires capturing semantic similarity beyond what term matching can provide. Term-based kNN search is not effective, implying that generalization to new queries is necessary for this task.

Model P@5 R@5 MAP MRR
Commercial Search Engines are not effective on DataFinder. In Table 2, we compare our proposed retrieval system against third-party dataset search engines. For each search engine, we choose the top 5 results before computing metrics. We find these third-party search engines do not effectively support full-sentence queries. We speculate these search engines are adapted from termbased web search engines. In contrast, our neural retriever gives much better search results using both keyword search and full-sentence query search.

Qualitative Analysis
Examples in Figure 7 highlight the tradeoffs between third-party search engines and models trained on DataFinder. In the first two examples, we see keyword-based search engines struggle when dealing with terms that could apply to many datasets, such as "semantic segmentation" or "link prediction". These keywords offer a limited specification on the relevant dataset, but a system trained on simulated search queries from real papers can learn implicit filters expressed in a query.
On the final example, our system incorrectly focuses on the deep architecture described ("deep neural network architecture [...] using depthwise separable convolutions") rather than the task described by the user ("machine translation"). Improving query understanding for long queries is a key opportunity for improvement on this dataset.

More In-depth Exploration
We perform in-depth qualitative analyses to understand the trade-offs of different query formats and dataset representations.
Comparing full-sentence vs keyword queries As mentioned above, we compare two versions of the DataFinder-based system: one trained and tested with description queries and the other with keyword queries. We observe that using keyword queries offers similar performance to using fullsentence descriptions for dataset search. This suggests more work should be done on making better use of implicit requirements in full-sentence descriptions for natural language dataset search.
Key factors for successful queries What information in queries is most important for effective dataset retrieval? Using human-annotated keyphrase queries in our test set, we experiment with concealing particular information from the keyphrase query.
In Figure 8, we see task information is critical for dataset search; removing task keywords from queries reduces MAP from 23.5 to 7.5 (statistically significant with p < 0.001 by a paired bootstrap t-test). Removing constraints on the language of text data also causes a significant drop in MAP (p < 0.0001). Removing keywords for text length causes an insignificant reduction in MAP  Figure 8: Comparison of the reduction in the MAP metric of the retrieval results after removing different types of query terms (e.g. keywords related to the task or language the researcher is interested in studying).  Table 3: Adding structured metadata for each dataset's textual representation significantly improves keyphrase search quality using a neural bi-encoder. We compute standard deviations via bootstrap resampling. We use the "Description + Struct. Info" textual representation for all other experiments in this paper.
(p = 0.15), though it causes a statistically significant reduction on other metrics not shown in Figure 8: P@5 and R@5. Based on inspection of our test set, we speculate that domain keywords are unnecessary because the domain is typically implied by task keywords.

Comparing textual representations of datasets
We represent datasets textually with a communitygenerated dataset description from PapersWith-Code, along with the title of the paper that introduced the dataset. We experiment with enriching this dataset representation in two ways. We first add structured metadata about each dataset (e.g. tasks, modality, number of papers that use each dataset on PapersWithCode). We cumulatively experiment with adding citances -sentences from other papers around a citation -to capture how others use the dataset. In Table 3, our neural biencoder achieves similar retrieval performance on all 3 representations for full-sentence search. Keyword search is more sensitive to dataset rep-

Full-Sentence Queries
SciBERT ( Table 4: Finetuning for the dataset recommendation task significantly outperforms strong retrieval architectures finetuned for general search, like COCO-DR.
resentation. adding structured information to the dataset representation provides significant benefits for keyword search. This suggests keyword search requires more specific dataset metadata than fullsentence search does to be effective.
The value of finetuning Our bi-encoder retriever is finetuned on our training set. Given the effort required to construct a training set for tasks like dataset recommendation, is this step necessary?
In Table 4, we see that an off-the-shelf SciBERT encoder is ineffective. We observe that our queries, which are abstract descriptions of the user's information need (Ravfogel et al., 2023), are very far from any documents in the embedding space, making comparison difficult. Using a state-of-the-art encoder, COCO-DR Base -which is trained for general-purpose passage retrieval on MS MARCO (Campos et al., 2016), helps with this issue but still cannot make up for task-specific finetuning.

Related Work
Most work on scientific dataset recommendation uses traditional search methods, including termbased keyword search and tag search (Lu et al., 2012;Kunze and Auer, 2013;Sansone et al., 2017;Chapman et al., 2019;Brickley et al., 2019;Lhoest et al., 2021). In 2019, Google Research launched Dataset Search (Brickley et al., 2019), offering access to over 2 million public datasets. Our work considers a subset of their search corpus -those datasets that have been posted on Papers with Code. Some work has considered other forms of dataset recommendation. Ben Ellefi et al. (2016) presented a system for dataset recommendation where the query is a "source dataset". More recently, Altaf et al. (2019) reported a system where the user's query is a set of research papers. Ours is the first to study full-sentence queries for dataset search, in contrast to conventional dataset search where queries are usually 3 or fewer tokens in length . The DataFinder Dataset is also the first dataset to our knowledge that supports data-driven research on dataset recommendation.

Conclusion
We introduce a new task for dataset recommendation from natural language queries. Our dataset supports search by either full-sentence or keyword queries, but we find that neural search algorithms trained for traditional keyword search are competitive with the same architectures trained for our proposed full-sentence search. An exciting future direction will be to make better use of natural language queries. We release our datasets along with our ranking systems to the public. We hope to spur the community to work on this task or on other tasks that can leverage the summaries, keyphrases, and relevance judgment annotations in our dataset.

Limitations
The primary limitations concern the dataset we created, which serves as the foundation of our findings. Our dataset suffers from four key limitations: Reliance on Papers With Code Our system is trained and evaluated to retrieve datasets from Papers With Code Datasets (PwC). Unfortunately, PwC is not exhaustive. Several queries in our test set corresponded to datasets that are not in PwC, such as IWSLT 2014 (Birch et al., 2014), PASCAL VOC 2010 (Everingham et al., 2010), and CHiME-4 (Vincent et al., 2017). Papers With Code Datasets also skews the publication year of papers used in the DataFinder Dataset towards the present (the median years of papers in our train and test set are 2018 and 2017, respectively). For the most part, PwC only includes datasets used by another paper listed in Papers With Code, leading to the systematic omission of datasets seldom used today.
Popular dataset bias in the test set Our test set is derived from the SciREX corpus (Jain et al., 2020). This corpus is biased towards popular or influential works: the median number of citations of a paper in SciREX is 129, compared to 19 for any computer science paper in S2ORC. The queries in our test set are therefore more likely to describe mainstream ideas in popular subields of AI.
Automatic tagging Our training data is generated automatically using a list of canonical dataset names from Papers With Code. This tagger mislabels papers where a dataset is used but never referred to by one of these canonical names (e.g. nonstandard abbreviations or capitalizations). Therefore, our training data is noisy and imperfect.
Queries in English only All queries in our training and test datasets were in English. Therefore, these datasets only support the development of dataset recommendation systems for Englishlanguage users. This is a serious limitation, as AI research is increasingly done in languages other English, such as Chinese (Chou, 2022).

Ethics Statement
Our work has the promise of improving the scientific method in artificial intelligence research, with the particular potential of being useful for younger researchers or students. We built our dataset and search systems with the intention that others could deploy and iterate on our dataset recommendation framework. However, we note that our initial dataset recommendation systems have the potential to increase inequities in two ways.
First, as mentioned in Limitations, our dataset does not support queries in languages other than English, which may exacerbate inequities in dataset access. We hope future researchers will consider the construction of multilingual dataset search queries as an area for future work.
Second, further study is required to understand how dataset recommendation systems affect the tasks, domains, and datasets that researchers choose to work on. Machine learning models are liable to amplify biases in training data (Hall et al., 2022), and inequities in which domains or tasks receive research attention could have societal consequences. We ask researchers to consider these implications when conducting work on our dataset.

A Few-Shot Prompt for Generating Keyphrases and Queries
When constructing our training set, we use incontext few-shot learning with the 6.7B parameter version of Galactica (Taylor et al., 2022). We perform in-context few-shot learning with the following prompt: Given an abstract from an artificial intelligence paper: 1) Extract keyphrases regarding the task (e.g. image classification), data modality (e.g. images or speech), domain (e.g. biomedical or aerial), training style (unsupervised, semi-supervised, fully supervised, or reinforcement learning), text length (sentence-level or paragraph-level), language required (e.g. English) 2) Write a brief, single-sentence summary containing these relevant keyphrases. This summary must describe the task studied in the paper.

Abstract:
We study automatic question generation for sentences from text passages in reading comprehension. We introduce an attention-based sequence learning model for the task and investigate the effect of encoding sentence-vs. paragraph-level information. In contrast to all previous work, our model does not rely on hand-crafted rules or a sophisticated NLP pipeline; it is instead trainable end-to-end via sequence-to-sequence learning.
Automatic evaluation results show that our system significantly outperforms the state-of-the-art rulebased system. In human evaluations, questions generated by our system are also rated as being more natural (i.e., grammaticality, fluency) and as more difficult to answer (in terms of syntactic and lexical divergence from the original text and reasoning needed to answer). We present a self-supervised approach to estimate flow in camera image and top-view grid map sequences using fully convolutional neural networks in the domain of automated driving. We extend existing approaches for self-supervised optical flow estimation by adding a regularizer expressing motion consistency assuming a static environment. However, as this assumption is violated for other moving traffic participants we also estimate a mask to scale this regularization. Adding a regularization towards motion consistency improves convergence and flow estimation accuracy. Furthermore, we scale the errors due to spatial flow inconsistency by a mask that we derive from the motion mask. This improves accuracy in regions where the flow drastically changes due to a better separation between static and dynamic environment. We apply our approach to optical flow estimation from camera image sequences, validate on odometry estimation and suggest a method to iteratively increase optical flow estimation accuracy using the generated motion masks. Finally, we provide quantitative and qualitative results based on the KITTI odometry and tracking benchmark for scene flow estimation based on grid map sequences. We show that we can improve accuracy and convergence when applying motion and spatial consistency regularization. In this paper, we study the actor-action semantic segmentation problem, which requires joint labeling of both actor and action categories in video frames. One major challenge for this task is that when an actor performs an action, different body parts of the actor provide different types of cues for the action category and may receive inconsistent action labeling when they are labeled independently. To address this issue, we propose an end-to-end region-based actor-action segmentation approach which relies on region masks from an instance segmentation algorithm. Our main novelty is to avoid labeling pixels in a region mask independently -instead we assign a single action label to these pixels to achieve consistent action labeling. When a pixel belongs to multiple region masks, max pooling is applied to resolve labeling conflicts. Our approach uses a two-stream network as the front-end (which learns features capturing both appearance and motion information), and uses two region-based segmentation networks as the back-end (which takes the fused features from the two-stream network as the input and predicts actor-action labeling). Experiments on the A2D dataset demonstrate that both the region-based segmentation strategy and the fused features from the two-stream network contribute to the performance improvements. The proposed approach outperforms the state-of-the-art results by more than For a given abstract that we want to process, we then add this abstract's text to this prompt and ask the language model to generate at most 250 new tokens.

B Information on Expert Annotations
As mentioned in §3, we recruited 27 graduate students, faculty, and recent graduate program alumni for our annotation collection process. For each annotator, we received their verbal or written interest in participating in our data collection.
We then sent them a Google Form containing between 10 and 20 abstracts to annotate. An example of the form instructions is included in Figure 9.
We originally had annotators label the "Training Style" (unsupervised, semi-supervised, supervised, or reinforcement learning), in addition to Task, Modality, Domain, Text Length, and Language Required. However, this field saw excessively noisy labels so we ignore this field for our experiments.