Speaker-independent recognition of Putonghua finals
of thesis entitled
Speaker- Independent Recognition of Putonghua Finals
CHAN, Chit Man
for the degree of Doctor of Philosophy
at the University of Hong Kong
A detailed study had been performed to address the problem of speaker-independent recognition of Putonghua (Mandarin) finals. The study included 35 Putonghua finals, 16 of which having trailing nasals. They were spoken by 51 speakers: 38 females, 13 males, in 5 different tones for two times. The sample was spectrally analyzed by a bank of 18 nonoverlapping critical-band filters. Three data reduction techniques:
Karhunen-Loeve Transformation (KLT) , Discrete Cosine Transformation (OCT) and Stepwise Discriminant Analysis (SDA) , were comparat i vely studied for their feature representation capability. The results indicated that KLT was superior to both OCT and SDA. Furthermore, the theoretic equivalence of OCT to KLT was found to be valid only with 5 or more feature dimensions used in computation. On the other hand, the results also showed that the Hahalanobis and a proposed modified Mahalanobis distance both gave a better measurement of performance than the other distances tested, which included the City Block, Euclidean, Minkowski, and Chebyshev.
In the second Part of the study, the Hidden Markov Modelling (HMM) technique was investigated. Three classification methods: Phonemic Labell ing (PL), Vector Quantization (VQ) and a proposed Hybrid Symbol (HS) generation, were studied for use with HMM. Whilst PL was found to be simple and efficient, its performance was not as good as VQ. However, the time taken by VQ was excessive, especially in training. The results with the HS method showed that it .could successfully merge the speed advantage of PL and the better discriminatory power of VQ. An approximately 80% saving in the quantizer training time could be achieved with only a marginal loss in performance. At the same time, it
was also found that allowing skipping of states in a Left-to-Right model (LRM) could lead to a negative effect on overall recognition.
As an indication of performance, the recognition rate of the simulated system was 81.3%, 95.0% and 98.0% with the best I, 2, and 3 candidates included, respectively, using a 256-level VQ and a 6-state, no-skip LRM on a sample of 8,400 finals from 48 speakers. The specific rates on non-nasal finals achieved even 96% - 98% using the best candidate alone .
School:The University of Hong Kong
School Location:China - Hong Kong SAR
Source Type:Master's Thesis
Keywords:mandarin dialects phonetics automatic speech recognition
Date of Publication:01/01/1988