2017
pdf
bib
abs
DT_Team at SemEval-2017 Task 1: Semantic Similarity Using Alignments, Sentence-Level Embeddings and Gaussian Mixture Model Output
Nabin Maharjan
|
Rajendra Banjade
|
Dipesh Gautam
|
Lasang J. Tamang
|
Vasile Rus
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)
We describe our system (DT Team) submitted at SemEval-2017 Task 1, Semantic Textual Similarity (STS) challenge for English (Track 5). We developed three different models with various features including similarity scores calculated using word and chunk alignments, word/sentence embeddings, and Gaussian Mixture Model(GMM). The correlation between our system’s output and the human judgments were up to 0.8536, which is more than 10% above baseline, and almost as good as the best performing system which was at 0.8547 correlation (the difference is just about 0.1%). Also, our system produced leading results when evaluated with a separate STS benchmark dataset. The word alignment and sentence embeddings based features were found to be very effective.
2016
pdf
bib
DTSim at SemEval-2016 Task 1: Semantic Similarity Model Including Multi-Level Alignment and Vector-Based Compositional Semantics
Rajendra Banjade
|
Nabin Maharjan
|
Dipesh Gautam
|
Vasile Rus
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
pdf
bib
DTSim at SemEval-2016 Task 2: Interpreting Similarity of Texts Based on Automated Chunking, Chunk Alignment and Semantic Relation Prediction
Rajendra Banjade
|
Nabin Maharjan
|
Nobal Bikram Niraula
|
Vasile Rus
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
pdf
bib
Evaluation Dataset (DT-Grade) and Word Weighting Approach towards Constructed Short Answers Assessment in Tutorial Dialogue Context
Rajendra Banjade
|
Nabin Maharjan
|
Nobal Bikram Niraula
|
Dipesh Gautam
|
Borhan Samei
|
Vasile Rus
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications
pdf
bib
abs
SemAligner: A Method and Tool for Aligning Chunks with Semantic Relation Types and Semantic Similarity Scores
Nabin Maharjan
|
Rajendra Banjade
|
Nobal Bikram Niraula
|
Vasile Rus
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper introduces a ruled-based method and software tool, called SemAligner, for aligning chunks across texts in a given pair of short English texts. The tool, based on the top performing method at the Interpretable Short Text Similarity shared task at SemEval 2015, where it was used with human annotated (gold) chunks, can now additionally process plain text-pairs using two powerful chunkers we developed, e.g. using Conditional Random Fields. Besides aligning chunks, the tool automatically assigns semantic relations to the aligned chunks (such as EQUI for equivalent and OPPO for opposite) and semantic similarity scores that measure the strength of the semantic relation between the aligned chunks. Experiments show that SemAligner performs competitively for system generated chunks and that these results are also comparable to results obtained on gold chunks. SemAligner has other capabilities such as handling various input formats and chunkers as well as extending lookup resources.
pdf
bib
abs
DT-Neg: Tutorial Dialogues Annotated for Negation Scope and Focus in Context
Rajendra Banjade
|
Vasile Rus
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Negation is often found more frequent in dialogue than commonly written texts, such as literary texts. Furthermore, the scope and focus of negation depends on context in dialogues than other forms of texts. Existing negation datasets have focused on non-dialogue texts such as literary texts where the scope and focus of negation is normally present within the same sentence where the negation is located and therefore are not the most appropriate to inform the development of negation handling algorithms for dialogue-based systems. In this paper, we present DT -Neg corpus (DeepTutor Negation corpus) which contains texts extracted from tutorial dialogues where students interacted with an Intelligent Tutoring System (ITS) to solve conceptual physics problems. The DT -Neg corpus contains annotated negations in student responses with scope and focus marked based on the context of the dialogue. Our dataset contains 1,088 instances and is available for research purposes at
http://language.memphis.edu/dt-neg.
2015
pdf
bib
NeRoSim: A System for Measuring and Interpreting Semantic Textual Similarity
Rajendra Banjade
|
Nobal Bikram Niraula
|
Nabin Maharjan
|
Vasile Rus
|
Dan Stefanescu
|
Mihai Lintean
|
Dipesh Gautam
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
2014
pdf
bib
abs
On Paraphrase Identification Corpora
Vasile Rus
|
Rajendra Banjade
|
Mihai Lintean
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We analyze in this paper a number of data sets proposed over the last decade or so for the task of paraphrase identification. The goal of the analysis is to identify the advantages as well as shortcomings of the previously proposed data sets. Based on the analysis, we then make recommendations about how to improve the process of creating and using such data sets for evaluating in the future approaches to the task of paraphrase identification or the more general task of semantic similarity. The recommendations are meant to improve our understanding of what a paraphrase is, offer a more fair ground for comparing approaches, increase the diversity of actual linguistic phenomena that future data sets will cover, and offer ways to improve our understanding of the contributions of various modules or approaches proposed for solving the task of paraphrase identification or similar tasks.
pdf
bib
abs
The DARE Corpus: A Resource for Anaphora Resolution in Dialogue Based Intelligent Tutoring Systems
Nobal Niraula
|
Vasile Rus
|
Rajendra Banjade
|
Dan Stefanescu
|
William Baggett
|
Brent Morgan
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We describe the DARE corpus, an annotated data set focusing on pronoun resolution in tutorial dialogue. Although data sets for general purpose anaphora resolution exist, they are not suitable for dialogue based Intelligent Tutoring Systems. To the best of our knowledge, no data set is currently available for pronoun resolution in dialogue based intelligent tutoring systems. The described DARE corpus consists of 1,000 annotated pronoun instances collected from conversations between high-school students and the intelligent tutoring system DeepTutor. The data set is publicly available.
pdf
bib
abs
Latent Semantic Analysis Models on Wikipedia and TASA
Dan Ștefănescu
|
Rajendra Banjade
|
Vasile Rus
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper introduces a collection of freely available Latent Semantic Analysis models built on the entire English Wikipedia and the TASA corpus. The models differ not only on their source, Wikipedia versus TASA, but also on the linguistic items they focus on: all words, content-words, nouns-verbs, and main concepts. Generating such models from large datasets (e.g. Wikipedia), that can provide a large coverage for the actual vocabulary in use, is computationally challenging, which is the reason why large LSA models are rarely available. Our experiments show that for the task of word-to-word similarity, the scores assigned by these models are strongly correlated with human judgment, outperforming many other frequently used measures, and comparable to the state of the art.
2013
pdf
bib
SEMILAR: The Semantic Similarity Toolkit
Vasile Rus
|
Mihai Lintean
|
Rajendra Banjade
|
Nobal Niraula
|
Dan Stefanescu
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations