Baden Hughes

2006

Searching for Language Resources on the Web: User Behaviour in the Open Language Archives Community
Baden Hughes
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

While much effort is expended in the curation of language resources, such investment is largely irrelevant if users cannot locate resourcesof interest. The Open Language Archives Community (OLAC) was established to define standards for the description of language resources and providecore infrastructure for a virtual digital library, thus addressing the resource discovery issue. In this paper we consider naturalistic user search behaviour in the Open Language Archives Community. Specifically, we have collected the query logs from the OLAC Search Engine over a 2 year period, collecting in excess of 1.2 million queries, in over 450K user search sessions. Subsequently we have mined these to discover user search patterns of various types, all pertaining to the discovery of language resources.A number of interesting observations can be made based on this analysis, in this paper we report on a range of properties and behaviours based on empirical evidence.

pdf bib abs

Reconsidering Language Identification for Written Language Resources
Baden Hughes | Timothy Baldwin | Steven Bird | Jeremy Nicholson | Andrew MacKinlay
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The task of identifying the language in which a given document (ranging from a sentence to thousands of pages) is written has been relatively well studied over several decades. Automated approachesto written language identification are used widely throughout research and industrial contexts, over both oral and written source materials. Despite this widespread acceptance, a review of previous research in written language identification reveals a number of questions which remain openand ripe for further investigation.

pdf bib abs

Feature-based Encoding and Querying Language Resources with Character Semantics
Baden Hughes | Dafydd Gibbon | Thorsten Trippel
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper we discuss the explicit representation of character features pertaining to written language resources, which we argue are critically necessary in the long term of archiving language data. Much focus on the creation of language resources and their associated preservation is at the level of the corpus itself; however it is generally accepted that long term interpretation of these language resources requires more than a best practice data format. In particular, where language resources are created in linguistic fieldwork, and especially for minority languages, the need for preservation not only of the resource itself, but of additional metadata which allows for the resource to be accurately interpreted in the future is becoming a topic of research in itself. In this paper we extend earlier work on semantically based character decomposition to include representation of character properties in a variety of models, and a mechanism for exploiting these properties through queries.

pdf bib

Frontiers in Linguistic Annotation for Lower-Density Languages
Mike Maxwell | Baden Hughes
Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006

Baden Hughes

2006

2005

2004

2003

Co-authors

Venues