Vasile Rus


2018

This paper investigates what differentiates effective tutorial sessions from less effective sessions. Towards this end, we characterize and explore human tutors’ actions in tutorial dialogue sessions by mapping the tutor-tutee interactions, which are streams of dialogue utterances, into streams of actions, based on the language-as-action theory. Next, we use human expert judgment measures, evidence of learning (EL) and evidence of soundness (ES), to identify effective and ineffective sessions. We perform sub-sequence pattern mining to identify sub-sequences of dialogue modes that discriminate good sessions from bad sessions. We finally use the results of sub-sequence analysis method to generate a tutorial Markov process for effective tutorial sessions.

2017

We describe our system (DT Team) submitted at SemEval-2017 Task 1, Semantic Textual Similarity (STS) challenge for English (Track 5). We developed three different models with various features including similarity scores calculated using word and chunk alignments, word/sentence embeddings, and Gaussian Mixture Model(GMM). The correlation between our system’s output and the human judgments were up to 0.8536, which is more than 10% above baseline, and almost as good as the best performing system which was at 0.8547 correlation (the difference is just about 0.1%). Also, our system produced leading results when evaluated with a separate STS benchmark dataset. The word alignment and sentence embeddings based features were found to be very effective.

2016

Identifying dialogue acts and dialogue modes during tutorial interactions is an extremely crucial sub-step in understanding patterns of effective tutor-tutee interactions. In this work, we develop a novel joint inference method that labels each utterance in a tutoring dialogue session with a dialogue act and a specific mode from a set of pre-defined dialogue acts and modes, respectively. Specifically, we develop our joint model using Markov Logic Networks (MLNs), a framework that combines first-order logic with probabilities, and is thus capable of representing complex, uncertain knowledge. We define first-order formulas in our MLN that encode the inter-dependencies between dialogue modes and more fine-grained dialogue actions. We then use a joint inference to jointly label the modes as well as the dialogue acts in an utterance. We compare our system against a pipeline system based on SVMs on a real-world dataset with tutoring sessions of over 500 students. Our results show that the joint inference system is far more effective than the pipeline system in mode detection, and improves over the performance of the pipeline system by about 6 points in F1 score. The joint inference system also performs much better than the pipeline system in the context of labeling modes that highlight important pedagogical steps in tutoring.
This paper introduces a ruled-based method and software tool, called SemAligner, for aligning chunks across texts in a given pair of short English texts. The tool, based on the top performing method at the Interpretable Short Text Similarity shared task at SemEval 2015, where it was used with human annotated (gold) chunks, can now additionally process plain text-pairs using two powerful chunkers we developed, e.g. using Conditional Random Fields. Besides aligning chunks, the tool automatically assigns semantic relations to the aligned chunks (such as EQUI for equivalent and OPPO for opposite) and semantic similarity scores that measure the strength of the semantic relation between the aligned chunks. Experiments show that SemAligner performs competitively for system generated chunks and that these results are also comparable to results obtained on gold chunks. SemAligner has other capabilities such as handling various input formats and chunkers as well as extending lookup resources.
Negation is often found more frequent in dialogue than commonly written texts, such as literary texts. Furthermore, the scope and focus of negation depends on context in dialogues than other forms of texts. Existing negation datasets have focused on non-dialogue texts such as literary texts where the scope and focus of negation is normally present within the same sentence where the negation is located and therefore are not the most appropriate to inform the development of negation handling algorithms for dialogue-based systems. In this paper, we present DT -Neg corpus (DeepTutor Negation corpus) which contains texts extracted from tutorial dialogues where students interacted with an Intelligent Tutoring System (ITS) to solve conceptual physics problems. The DT -Neg corpus contains annotated negations in student responses with scope and focus marked based on the context of the dialogue. Our dataset contains 1,088 instances and is available for research purposes at http://language.memphis.edu/dt-neg.

2015

2014

We analyze in this paper a number of data sets proposed over the last decade or so for the task of paraphrase identification. The goal of the analysis is to identify the advantages as well as shortcomings of the previously proposed data sets. Based on the analysis, we then make recommendations about how to improve the process of creating and using such data sets for evaluating in the future approaches to the task of paraphrase identification or the more general task of semantic similarity. The recommendations are meant to improve our understanding of what a paraphrase is, offer a more fair ground for comparing approaches, increase the diversity of actual linguistic phenomena that future data sets will cover, and offer ways to improve our understanding of the contributions of various modules or approaches proposed for solving the task of paraphrase identification or similar tasks.
We describe the DARE corpus, an annotated data set focusing on pronoun resolution in tutorial dialogue. Although data sets for general purpose anaphora resolution exist, they are not suitable for dialogue based Intelligent Tutoring Systems. To the best of our knowledge, no data set is currently available for pronoun resolution in dialogue based intelligent tutoring systems. The described DARE corpus consists of 1,000 annotated pronoun instances collected from conversations between high-school students and the intelligent tutoring system DeepTutor. The data set is publicly available.
This paper introduces a collection of freely available Latent Semantic Analysis models built on the entire English Wikipedia and the TASA corpus. The models differ not only on their source, Wikipedia versus TASA, but also on the linguistic items they focus on: all words, content-words, nouns-verbs, and main concepts. Generating such models from large datasets (e.g. Wikipedia), that can provide a large coverage for the actual vocabulary in use, is computationally challenging, which is the reason why large LSA models are rarely available. Our experiments show that for the task of word-to-word similarity, the scores assigned by these models are strongly correlated with human judgment, outperforming many other frequently used measures, and comparable to the state of the art.

2013

2012

2011

2010

2006

The paper presents a software system that embodies a lexico-syntactic approach to the task of Textual Entailment. Although the approach is based on a minimal set of resources it is highly confident. The architecture of the system is open and can be easily expanded with more and deeper processing modules. Results on a standard data set are presented.

2005

2004

2001

2000