Constructing a language model based on data mining techniques for a Chinese character recognition system

by Chen, Yong

Abstract (Summary)
(Uncorrected OCR) Abstract of thesis entitled Constructing a Language Model Based on Data Mining Techniques for a Chinese Character Recognition System Submitted by Chen, Yong For the degree of Doctor of Philosophy at The University of Hong Kong in November 2004 Language models are commonly used in postprocessing algorithms of recognition systems to improve the recognition rate. In postprocessing, a language model serves as a knowledge base to assist the recognition system in making a better decision. N-gram models, widely used in recognition systems, reflect the relationship between the current word and the (N-l) immediately preceding words. Due to data sparseness, only a few of the preceding words can be observed in practice. As a result, anN-gram model can only capture short distance dependencies between words. The trigger pair language model is able to capture long distance dependencies between words. However, the trigger pair usually has only a single word for the trigger and the triggered word. In this study, we propose a new language model which consists of word sequences of more than two words. Each word sequence can generate a number of multiple word trigger pairs. The mutual information values of these multiple word trigger pairs are investigated, and are used to find a best path within a word lattice. More importantly, based on these word sequences and the fact that Chinese words are mostly composed of multiple characters, we propose a new postprocessing strategy?ord suggesting. When using a language model with a recognizer, the top n candidate recognition rate becomes the upper bound of the final recognition rate. The word suggesting strategy can enhance the recognition rate, and even help the recognizer to achieve a Ecognition rate greater than the upper bound. It is very common for a recognizer dealing with a multiple-character Chinese word to only partially recognize the word, because it recognizes some constituent characters correctly but not others. The correctly recognized characters, however, can be good hints for recovering the word. This is consistent with people's experience in reading Chinese text: when some characters in a word are mistyped or get blurred, people can often infer the word by referring to the surrounding characters. However, sometimes using only those correct constituent characters is not enough for people to determine the correct word, since there may be more than one word in a language sharing these correctly identified characters. That is, there is more than one possible choice. In fact, people usually refer to the context when selecting an appropriate word from the possible choices. By "appropriate" we mean that the selected word is consistent with the context. The words referred to are usually highly associated with the guessed word. Another point is that these words may precede, follow or surround the guessed word. This provides excellent opportunities for people to correct a word by referring to its neighboring words. Our proposed language model can help Chinese character recognizers to simulate human word correcting behavior as described above. The approach can help the recognizer to achieve a recognition rate greater than the top n candidate recognition rate, which is an upper bound for previous postprocessing approaches. We use data mining techniques to construct the model. The abstract contains 491 words
Bibliographical Information:


School:The University of Hong Kong

School Location:China - Hong Kong SAR

Source Type:Master's Thesis

Keywords:chinese character sets data processing optical mining recognition devices


Date of Publication:01/01/2005

© 2009 All Rights Reserved.