We propose a compositional Bayesian semantics that interprets declarative sentences in a natural language by assigning them probability conditions. These are conditional probabilities that estimate the likelihood that a competent speaker would endorse an assertion, given certain hypotheses. Our semantics is implemented in a functional programming language. It estimates the marginal probability of a sentence through Markov Chain Monte Carlo (MCMC) sampling of objects in vector space models satisfying specified hypotheses. We apply our semantics to examples with several predicates and generalised quantifiers, including higher-order quantifiers. It captures the vagueness of predication (both gradable and non-gradable), without positing a precise boundary for classifier application. We present a basic account of semantic learning based on our semantic system. We compare our proposal to other current theories of probabilistic semantics, and we show that it offers several important advantages over these accounts.
Detecting Linguistic Traces of Depression in Topic-Restricted Text: Attending to Self-Stigmatized Depression with NLP
JT Wolohan | Misato Hiraga | Atreyee Mukherjee | Zeeshan Ali Sayyed | Matthew Millard
Natural language processing researchers have proven the ability of machine learning approaches to detect depression-related cues from language; however, to date, these efforts have primarily assumed it was acceptable to leave depression-related texts in the data. Our concerns with this are twofold: first, that the models may be overfitting on depression-related signals, which may not be present in all depressed users (only those who talk about depression on social media); and second, that these models would under-perform for users who are sensitive to the public stigma of depression. This study demonstrates the validity to those concerns. We construct a novel corpus of texts from 12,106 Reddit users and perform lexical and predictive analyses under two conditions: one where all text produced by the users is included and one where the depression data is withheld. We find significant differences in the language used by depressed users under the two conditions as well as a difference in the ability of machine learning algorithms to correctly detect depression. However, despite the lexical differences and reduced classification performance–each of which suggests that users may be able to fool algorithms by avoiding direct discussion of depression–a still respectable overall performance suggests lexical models are reasonably robust and well suited for a role in a diagnostic or monitoring capacity.
Arabic Broken Plurals show an interesting phenomenon in Arabic morphology as they are formed by shifting the consonants of the syllables into different syllable patterns, and subsequently, the pattern of the word changes. The present paper, therefore, attempts to look at Arabic broken plurals from the perspective of neural networks by implementing an OpenNMT experiment to better understand and interpret the behavior of these plurals, especially when it comes to L2 acquisition. The results show that the model is successful in predicting the Arabic template. However, it fails to predict certain consonants such as the emphatics and the gutturals. This reinforces the fact that these consonants or sounds are the most difficult for L2 learners to acquire.
Ever increasing ransomware attacks and thefts of intellectual property demand cybersecurity solutions to protect critical documents. One emerging solution is to place fake text documents in the repository of critical documents for deceiving and catching cyber attackers. We can generate fake text documents by obscuring the salient information in legit text documents. However, the obscuring process can result in linguistic inconsistencies, such as broken co-references and illogical flow of ideas across the sentences, which can discern the fake document and render it unbelievable. In this paper, we propose a novel method to generate believable fake text documents by automatically improving the linguistic consistency of computer-generated fake text. Our method focuses on enhancing syntactic cohesion and semantic coherence across discourse segments. We conduct experiments with human subjects to evaluate the effect of believability improvements in distinguishing legit texts from fake texts. Results show that the probability to distinguish legit texts from believable fake texts is consistently lower than from fake texts that have not been improved in believability. This indicates the effectiveness of our method in generating believable fake text.
The Winograd Schema Challenge targets pronominal anaphora resolution problems which require the application of cognitive inference in combination with world knowledge. These problems are easy to solve for humans but most difficult to solve for machines. Computational models that previously addressed this task rely on syntactic preprocessing and incorporation of external knowledge by manually crafted features. We address the Winograd Schema Challenge from a new perspective as a sequence ranking task, and design a Siamese neural sequence ranking model which performs significantly better than a random baseline, even when solely trained on sequences of words. We evaluate against a baseline and a state-of-the-art system on two data sets and show that anonymization of noun phrase candidates strongly helps our model to generalize.
Sentences with presuppositions are often treated as uninterpretable or unvalued (neither true nor false) if their presuppositions are not satisfied. However, there is an open question as to how this satisfaction is calculated. In some cases, determining whether a presupposition is satisfied is not a trivial task (or even a decidable one), yet native speakers are able to quickly and confidently identify instances of presupposition failure. I propose that this can be accounted for with a form of possible world semantics that encapsulates some reasoning abilities, but is limited in its computational power, thus circumventing the need to solve computationally difficult problems. This can be modeled using a variant of the framework of finite state semantics proposed by Rooth (2017). A few modifications to this system are necessary, including its extension into a three-valued logic to account for presupposition. Within this framework, the logic necessary to calculate presupposition satisfaction is readily available, but there is no risk of needing exceptional computational power. This correctly predicts that certain presuppositions will not be calculated intuitively, while others can be easily evaluated.
We explore the use of natural language processing and machine learning for detecting evidence of Parkinson’s disease from transcribed speech of subjects who are describing everyday tasks. Experiments reveal the difficulty of treating this as a binary classification task, and a multi-class approach yields superior results. We also show that these models can be used to predict cognitive abilities across all subjects.
This paper explores the correlations between key syntactic dependencies and the occurrence of simple spoken language disfluencies such as filled pauses and incomplete words. The working hypothesis here is that interruptions caused by these phenomena are more likely to happen between weakly connected words from a syntactic point of view than between strongly connected ones. The obtained results show significant patterns with the regard to key syntactic phenomena, like confirming the positive correlation between the frequency of disfluencies and multiples measures of syntactic complexity. In addition, they show that there is a stronger relationship between the verb and its subject than with its object, which confirms the idea of a hierarchical incrementality. Also, this work uncovered an interesting role played by a verb particle as a syntactic delimiter of some verb complements. Finally, the interruptions by disfluencies patterns show that verbs have a more privileged relationship with their preposition compared to the object Noun Phrase (NP).
Older adults tend to suffer a decline in some of their cognitive capabilities, being language one of least affected processes. Word association norms (WAN) also known as free word associations reflect word-word relations, the participant reads or hears a word and is asked to write or say the first word that comes to mind. Free word associations show how the organization of semantic memory remains almost unchanged with age. We have performed a WAN task with very small samples of older adults with Alzheimer’s disease (AD), vascular dementia (VaD) and mixed dementia (MxD), and also with a control group of typical aging adults, matched by age, sex and education. All of them are native speakers of Mexican Spanish. The results show, as expected, that Alzheimer disease has a very important impact in lexical retrieval, unlike vascular and mixed dementia. This suggests that linguistic tests elaborated from WAN can be also used for detecting AD at early stages.
In this paper, we discuss the development of a part-of-speech tagger for English-Assamese code-mixed texts. We provide a comparison of 2 approaches to annotating code-mixed data – a) annotation of the texts from the two languages using monolingual resources from each language and b) annotation of the text through a different resource created specifically for code-mixed data. We present a comparative study of the efforts required in each approach and the final performance of the system. Based on this, we argue that it might be a better approach to develop new technologies using code-mixed data instead of monolingual, ‘clean’ data, especially for those languages where we do not have significant tools and technologies available till now.