Shota Wada
2005
Language and Encoding Scheme Identification of Extremely Large Sets of Multilingual Text
Pavol Zavarsky
|
Yoshiki Mikami
|
Shota Wada
Proceedings of Machine Translation Summit X: Posters
In the paper we present an outline of our approach to identify languages and encoding schemes in extremely large sets of multi-lingual documents. The large sets we are analyzing in our Language Observatory project [1] are formed by dozens of millions of text documents. In the paper we present an approach which allows us to analyze about 250 documents every second (about 20 million documents/day) on a single Linux machine. Using a multithread processing on a cluster of Linux servers we are able to analyze easily more than 100 million documents/day.