Abstract- Automatic speech segmentation is an essential tool for building large corpora for training speech recognition systems. Manual segmentation of speech is both time consuming and an error-prone task. Several automatic segmentation systems have been
Using Prosody in Automatic Segmentation of SpeechOssama Essa Computer Science Department University of South Carolina Columbia, SC 29208 essa@cs.sc.eduAbstract- Automatic speech segmentation is an essentialdetection of speech endpoints was used. The algorithm used both energy and zero crossing thresholds to detect the beginning and end of the speech. In many examples, initial and nal fricatives were indistinguishable from background noise. In our case, since the phonetic transcription of the speech utterance is known before hand, a more accurate endpoint detection was achieved by lowering the energy threshold in the cases of initial and nal fricatives. The signal is then ltered using a low-pass lter 0-2000Hz] to eliminate high frequency components. No spectral analysis of the speech signal is needed since the system processes only the time domain signal (speech waveform).Phonetic Estimate Transcrp.Phonetic Phonetic Segmentation Estimate Segments Detect Voiced Segement Labeled Low- Speech Voiced Speech Speech Energy SubEndpoints Regions Segments Speech PreSignal Process Speech Filtered Speech Signal Signal
tool for building large corpora for training speech recognition systems. Manual segmentation of speech is both time consuming and an error-prone task. Several automatic segmentation systems have been proposed based on the acoustical features of the speech 5] 9]. In rhythmic speech, prosodic features become an essential factor in designing an accurate automatic segmentation system. This work presents a novel technique for automatic segmentation of speech in which both prosodic and acoustical features of the speech are examined to achieve a higher accuracy of segmentation. The system was tested on Koranic Arabic, a highly rhythmic language. This paper shows that incorporating the prosodic features in the design resulted in better segmentation accuracy for rhythmic speech.
1 IntroductionThis work presents a method for incorporating the prosodic features inherent in Koranic Arabic as part of an automatic speech segmentation system. Koran recitation is best described as long, slow-paced rhythmic, monotone utterances 7]. The sound of Koranic recitation is recognizably unique and reproducible according to a set of pronunciation rules, tajweed, designed for clear and accurate presentation of the text. The input to the system is the speech signal and the phonetic transcription of the speech utterance. The segmentation of the speech utterance is to detect the boundaries of each phoneme within the speech signal. The system's block diagram is shown in gure 1. The speech signal is rst passed through the preprocessing module where speech endpoints are detected. The algorithm developed by Rabiner and Sambur 8] for
-
6
? 6
-
?
-
-
6
Figure 1: Automatic Speech Segmentation
2 Phonetic EstimateThe basic Arabic phonemes are shown in table 1 1] 3]. There are three vowels in the Arabic language (/a/,/u/, and/i/) and each vowel can be either short
or long. The allowed syllables in Arabic are: CV, CVV, CVC, CVVC, CVCC, and CVVCC where VV indicates a long vowel 6]. CVCC and CVVCC can only take place in the nal position. In these restricted syllable structures, no more than two consecutive consonants are allowed. Diphthongs (vowel successions) do not occur in
Abstract- Automatic speech segmentation is an essential tool for building large corpora for training speech recognition systems. Manual segmentation of speech is both time consuming and an error-prone task. Several automatic segmentation systems have been
A. Sym./a//u//i//'//b//t//th//j//H//x//d//Th//r//z//s//sh//S//D//T//Z//?//gh//f//q//k//l//m//n/ u/hu/ v/hv//w//y/@@@ Z H . H H c k s X X P R
Description Low center unrounded vowel High back rounded vowel High front unrounded vowel Voiced glottal stop Voiced bilabial stop Unvoiced dental stop Unvoiced inter-dental fricative Voiced dental fricative Unvoiced pharyngeal fricative Unvoiced velar fricative Voiced dental stop Voiced inter-dental fricative Voiced dental trill Voiced dental fricative Unvoiced dental fricative Unvoiced palatal fricative Unvoiced velarized dental fricative Voiced velarized palatal stop Unvoiced velarized palatal stop Voiced velarized interdental fricative Voiced pharyngeal fricative Voiced uvular fricative Unvoiced labiodental fricative Unvoiced uvular stop Unvoiced velar stop Voiced dental sonorant Voiced bilabial nasal Voiced dental nasal Unvoiced glottal fricative Voiced glottal fricative Voiced bilabial sonorant Voiced palatal sonorant
3 3 3 2 2 1 1 2 1 1 2 2 3 2 1 1 1 2 1 2 3 2 1 1 1 3 3 3 1 3 3 3
eliminate speaker dependency, each phoneme length is measured in relative time units, Ut, instead of absolute time. Each phoneme is assigned an integer number of relative time units(Ut ). The speaker changes the speed, without a ecting the rhythm, by changing the value of Ut . In long utterances, the value of Ut can vary throughout the utterance, but in short utterances Ut was more consistent. Short vowels were used as the basis for calculating the length of the rest of the phonemes. Each short vowel was assigned three Ut 's, and the relative lengths of the rest of the phonemes were measured accordingly. Table 2 shows a partial list of phoneme lengths, in relative time units, where empty entries represent inadmissable contexts.C
搜索“diyifanwen.net”或“第一范文网”即可找到本站免费阅读全部范文。收藏本站方便下次阅读,第一范文网,提供最新人文社科Using Prosody in Automatic Segmentation of Speech全文阅读和word下载服务。
相关推荐: