Baden Hughes


2006

While much effort is expended in the curation of language resources, such investment is largely irrelevant if users cannot locate resourcesof interest. The Open Language Archives Community (OLAC) was established to define standards for the description of language resources and providecore infrastructure for a virtual digital library, thus addressing the resource discovery issue. In this paper we consider naturalistic user search behaviour in the Open Language Archives Community. Specifically, we have collected the query logs from the OLAC Search Engine over a 2 year period, collecting in excess of 1.2 million queries, in over 450K user search sessions. Subsequently we have mined these to discover user search patterns of various types, all pertaining to the discovery of language resources.A number of interesting observations can be made based on this analysis, in this paper we report on a range of properties and behaviours based on empirical evidence.
The task of identifying the language in which a given document (ranging from a sentence to thousands of pages) is written has been relatively well studied over several decades. Automated approachesto written language identification are used widely throughout research and industrial contexts, over both oral and written source materials. Despite this widespread acceptance, a review of previous research in written language identification reveals a number of questions which remain openand ripe for further investigation.
In this paper we discuss the explicit representation of character features pertaining to written language resources, which we argue are critically necessary in the long term of archiving language data. Much focus on the creation of language resources and their associated preservation is at the level of the corpus itself; however it is generally accepted that long term interpretation of these language resources requires more than a best practice data format. In particular, where language resources are created in linguistic fieldwork, and especially for minority languages, the need for preservation not only of the resource itself, but of additional metadata which allows for the resource to be accurately interpreted in the future is becoming a topic of research in itself. In this paper we extend earlier work on semantically based character decomposition to include representation of character properties in a variety of models, and a mechanism for exploiting these properties through queries.

2005

2004

2003