Proceedings of the KSPS conference (대한음성학회:학술대회논문집)
The Korean Society Of Phonetic Sciences And Speech Technology
- Semi Annual
Domain
- Linguistics > Linguistics, General
2002.11a
-
In this paper it is discussed what should be taken into consideration with respect to segmentation and labeling in creation of speech corpus. What levels of annotation and what kind of contents should be included, what kind of acoustic information is checked for in segmentation, etc are discussed.
-
As the telematics service that is the integration of information technology approaches commercialization, the necessity and gravity of speech technology is rapidly growing. The speech technology occupies important position in the telematics service because it informs the starting of service and the retrieved result. This service must provide high accuracy of speech recognition and natural synthesis of human speech in a driving environment and it is especially true for the fee-for-service. For high quality TTS, the speech synthesis technique that makes optimal synthesis database and uses efficiently this database is required. In this paper, we describe the design of phonetically balanced sentences used for speech database, the selection of service-suitable-speaker, the extraction methods of accurate phoneme boundary, and the factors which are taken into consideration in the extraction stage of prosody. Finally we show the real case that has commercially implemented.
-
본 논문에서는 한국어 합성기의 명료도 및 자연성 평가방안에 대한 개략적인 설명과 이 방안을 실제로 2종류의 서로 다른 한국어 합성기에 적용한 결과를 요약하였다. 한편, 이러한 평가결과를 바탕으로 실제로 이루어진 음질 개선 실 예를 소개하는 한편 향후 한국어 합성기의 성능 개선 방향을 제안하였다.
-
This paper is related to the enhancement of speech recognition rate using enhanced pronunciation dictionary. Modern large vocabulary, continuous speech recognition systems have pronunciation dictionaries. A pronunciation dictionary provides pronunciation information for each word in the vocabulary in phonemic units, which are modeled in detail by the acoustic models. But in most speech recognition system based on Hidden Markov Model, actual pronunciation variations are disregarded. Without the pronunciation variations in the speech recognition system, the phonetic transcriptions in the dictionary do not match the actual occurrences in the database. In this paper, we proposed the unvoiced rule of semivowel in allophone rules to pronunciation dictionary. Experimental results on speech recognition system give higher performance than existing pronunciation dictionaries.
-
This paper is to introduce 'Dr. Speaking', which was recently developed by Eonon Inc.. 'Dr. Speaking' is an English pronunciation tutoring system. This has three distinguishing features. First, it teaches how to organize a speaker's vocal organs to pronounce accurately. Second, after it compares a speaker's pronunciation with that of a native speaker's, it grades that speaker's pronunciation level according to phonetic standards. Third, it provides proper information necessary for correcting a speaker's incorrect pronunciation. It is not always easy for a tutoring system to execute the above three almost simutaneously. However, 'Dr. Speaking' proved itself that it is possible by adding speech technology (e.g. speech recognition) to phonetic knowledge.
-
This article discusses the teaching of listening and reading skills through enhancing the awareness of pronunciation. First, it examines the problems which take place in listening comprehension, and seeks the ways in which we can teach the skill rather than simply practise it. The approaches proposed are based on micro-listening exercises which practise individual subskills of listening, especially by using the cloze test and tracking. The issue of using authentic materials is then examined for teaching recognition of the features of natural speech. Finally, it is argued that classroom activities need to take account of the true nature of real-life L2 listening.
-
The purpose of this paper is to establish standard for selecting musical materials for teaching English pronunciation in elementary school. For this purpose, 110 songs of wee sings and the curricula of English and Music of elementary school are analyzed, and the results became the basis of establishing standards. Four standards are established and each standard divided into several steps according to the degree of complexity. And the degree of complexity of two songs are figured out to know the possibility of application.
-
This study concerns the constraints of English Poetic Meter. In English poems, the metrical pattern doesn't always match the linguistic stress on the lines. These mismatches are found differently among the poets. For the lexical stress mismatched with the weak metrical position,
$*W{\Rightarrow}{\;}Strength$ is established by the concept of the strong syllable. The peaks of monosyllabic words mismatched with the weak metrical position are divided according to which side of the boundary of a phonological domain they are adjacent to. Adjacency Constraint I is suggested for the mismatched peak which is adjacent to the left boundary of a phonological domain; *Peak] and Adjacency ConstraintII for the mismatched peak which are adjacent to the right boundary of a phonological domain. These constraints are various according to the poets(Pope, Milton and Shakespeare) : *[Peak [-stress], $W{\Rightarrow}{\;}*Strength$ and *Peak] in Pope; *[+stress][Peak [-stress] and *Peak] in Milton ; *[+stress][Peak [-stress],$W{\;}{\Rightarrow}{\;}*Strength$ and ACII in Shakespeare. -
The purpose of this paper is to research Korean boundary tone of sentence type and perceptive speaker's attitude according to speech rate - three type. In view of the preceding study, Korean intonation's meaning is determined by boundary tone. Also, in my experimental results, Korean boundary tone of sentence type has preferential tone. However, Korean boundary tone of sentence type is not influential according to speech rate. The speech rate's change of three pattern is influential in auditor's perceptual response. The relationship between the pitch contour of boundary tone and speech rate is not significant.
-
This study proposes phonation type index k as a descriptor of the overall spectral tilt, which is free from the effects of fundamental frequency and vowel quality. The newly proposed phonation type index k presents a simple and single measure of the overall spectral tilt. Phonation type index k can be applied to speech technology. It can also be used in diagnosing patients voice qualities in speech pathology. The distribution of phonation type index k, which is speaker-dependent, may be useful in forensic phonetics and voice recognition as an indicator of speaker identity.
-
We present a statistical analysis of Korean phonological variations using automatic generation of phonetic transcription. We have constructed the automatic generation system of Korean pronunciation variants by applying rules modeling obligatory and optional phonemic changes and allophonic changes. These rules are derived from knowledge-based morphophonological analysis and government standard pronunciation rules. This system is optimized for continuous speech recognition by generating phonetic transcriptions for training and constructing a pronunciation dictionary for recognition. In this paper, we describe Korean phonological variations by analyzing the statistics of phonemic change rule applications for the 60,000 sentences in the Samsung PBS(Phonetic Balanced Sentence) Speech DB. Our results show that the most frequently happening obligatory phonemic variations are in the order of liaison, tensification, aspirationalization, and nasalization of obstruent, and that the most frequently happening optional phonemic variations are in the order of initial consonant h-deletion, insertion of final consonant with the same place of articulation as the next consonants, and deletion of final consonant with the same place of articulation as the next consonants. These statistics can be used for improving the performance of speech recognition systems.
-
According to Pierrehumbert (1980), two level tones - H and L - are enough in representing intonation of intonational languages. But in Korean, high fall and low fall boundary tones, both of which must be represented as HL% in intonational phonology as in Jun (1993, 1999), are distinct not only acoustically but also functionally. The same is true in the case of high level and mid level boundary tones, which must be represented as H% in intonational phonology. In this paper, I conducted two identification tests to provide crucial evidence that H and L are not enough in intonational phonology. The results of the identification tests show that categorical perception occur between high level and low level as well as between high fall and low fall. Based on this fact and the results of the acoustic analyses in Lee (1999, 2000), I strongly propose to adopt one more level tone - M - to represent Korean boundary tones.
-
This research studied the correlation of word frequency effect in Korean corpus. Experiment 1 showed that word frequency of each other corpus was significant correlated. Experiment 2 showed significant correlation between word frequency of each corpus and lexical decision time of participants. These results support that 4 corpus in this research should have stability to word frequency effect of participants
-
The purpose of this paper is to investigate the unit of neighbor of Korean words. In English, a word's orthographic neighborhood is defined as the set of words that can be created by changing one letter of the word while preserving letter positions. For example, the words like pike, pole, and tile are all orthographic neighbors of the word 'pile'. In this study, 2 experiments were performed. In these experiments, 4 conditions of prime were included: primes sharing first letter of first syllable(1), first syllable(2), first syllable and the first letter of second syllable with target(3) and with no formal similarity with target(4). In Exp.1, RT was shortest in condition 3. In Exp.2, condition 2 had the shortest RT. We came to the conclusion that in Korean, a word's neighbor is words that share at least one syllable with the word.
-
The purpose of this study was to investigate which model among Fullist, Decomposition, and Hybrid was appropriate for explaining the process of Korean verb, especially on tense prefinal ending, connective ending, and morphological passive affix. Three experiment was performed. The results of experiment 1, 2, 3 suggest that it is necessary for a new model of Korean verb processing.
-
In the present study, the intelligibility of the synthesized speech sounds was evaluated by using the psycholinguistic and fMRI techniques, In order to see the difference in recognizing words between the natural and synthesized speech sounds, word regularity and word frequency were varied. The results of Experiment1 and Experiment2 showed that the intelligibility difference of the synthesized speech comes from word regularity. There were smaller activation of the auditory areas in brain and slower recognition time for the regular words.
-
Lee ChiGeun;Lee EunSuk;Lee HaeJung;Kim BongWan;Joung SukTae;Jung SungTae;Lee YongJoo;Han MoonSung 111
Complementary use of several modalities in human-to-human communication ensures high accuracy, and only few communication problem occur. Therefore, multimodal interface is considered as the next generation interface between human and computer. This paper presents the current status and research themes of speech-based multimodal interface technology, It first introduces about the concept of multimodal interface. It surveys the recognition technologies of input modalities and synthesis technologies of output modalities. After that it surveys integration technology of modality. Finally, it presents research themes of speech-based multimodal interface technology. -
Neural networks have been known to have great discriminative power in pattern classification problems. In this paper, the multilayer perceptron neural networks are employed to automatically detect laryngeal pathology in speech. Also new feature parameters are introduced which can reflect the periodicity of speech and its perturbation. These parameters and cepstral coefficients are used as input of the multilayer perceptron neural networks. According to the experiment using Korean disordered speech database, incorporation of new parameters with cepstral coefficients outperforms the case with only cepstral coefficients.
-
This paper focuses on building up a database of commercial stocks using XML syntax and looks into a way of building up a system with the combination of XML and XSLT that provides connectivity to client-server databases through vocal means. The use of XSLT has several advantages. Most importantly, it can transform a type of data into different formats. A vocal interface minimizes some space and time limits imposed on users outside premises when they need an instant connection to their database. In this fashion, the users can check information on stock lists without being pressurized by certain limits. PC, PDAs and cellular phones are some examples of mobile connection. The use of VoiceXML creates vocal applications. In VoiceXML servies, users can gain immediate access to data upon the input of their voices and the DTMF signals of the telephone.
-
In this paper, we describe our LVCSR system for Korean broadcast news transcription. The main focus is to find the most proper morpheme-based lexical model for Korean broadcast news recognition to deal with the inflectional flexibilities in Korean. There are trade-offs between lexicon size and lexical coverage, and between the length of lexical unit and WER. In our system, we analyzed the training corpus to obtain a small 24k-morpheme-based lexicon with 98.8% coverage. Then, the lexicon is optimized by combining morphemes using statistics of training corpus under monosyllable constraint or maximum length constraint. In experiments, our system reduced the number of monosyllable morphemes from 52% to 29% of the lexicon and obtained 13.24% WER for anchor and 24.97% for reporter.
-
Aerodynamic analysis study was performed on 14 normal subjects(2 male, 12 female) by nonsense syllables composed of Korean bilabial stop(/p, p',
$p^{h}$ ) and their preceding and/or following vowel /i, a, u/. That is [pi, p'i, phi, pa, p'a, pha, pu, p'u,$p^{h}u$ ]. All measures were analysed using Aerophone II voice function analyzer and included peak air pressure, mean air pressure, maximum flow rate, volume, mean SPL. As results, first, MSPL and MAP of /p, p',$p^{h}$ / themselves were significantly different. In addition, different vowel enviroment also produced significantliy different aerodynamic chracteristics those consonants. -
The purpose of this study was to determine the correlation between the Average Fundamental Frequency, Fo-Tremor Frequency, Jitter, Shimmer, Amplitude Tremor Intensity Index, and Noise to Harmonic Ratio of MDVP and Fo, Fo Tremor, Jitter, Shimmer, Amp Tremor, HNR, and NNE of Dr. Speech. The Pearson correlation coefficient was used for analysis. The results showed that there was a strong correlation between Fo and Shimmer of both instruments. However, the remaining parameters did not show a significant correlation.
-
This study was conducted to examine the korean word length effects on auditory word recognition. Linguistically, word length can be defined by several sublexical units such as letters, phonemes, syllables, and so on. In order to investigate which units are used in auditory word recognition, lexical decision task was used. Experiment 1 and 2 showed that syllable length affected response time, and syllable length interacted with word frequency. As a result, in recognizing auditory word syllable length was important variable.
-
The relations between words intelligibility and sentences intelligibility were tested on adults with cerebral palsy(athetoid type). Intelligibility is used as an important evaluation value in the field of diagnosis and therapy of dysarthric patients. In order to develop one syllable phonetic contrast intelligibility test using specific phonetic contrasts, the correlation with sentences intelligibility was tested to find out the validity. Pearson's simple correlation coefficient was .83 that shows a high correlation. Also, comparing the range and standard deviation given by seven evaluators on each subject, it was shown that when evaluating patients of moderate intelligibility, words intelligibility was more reliable than sentences intelligibility.
-
This study was performed to find out changes in acoustic measurements of voice after eating egg, apple and pear. Ten college students vocalized /a/ before and after eating egg, apple and pear. Dr. Speech was utilized to obtain changes of subjects's acoustic measurements. A t-test was peformed to determine acoustic changes of voice before and after eating egg, apple and pear. No significant difference was observed in acoustic measurements before and after eating egg, apple and pear. However, the subjects seemed to show some improvements in Jitter, HNR, and NNE in the order of egg, apple, and pear even though they did not reach a statistical significance. It was concluded that a more systematic research paradigm is needed in order to objectively reject or substantiate a variety of conceptions on food items and their effects on voice.
-
The aim of this paper is to investigate the relationship between production and perception of Korean vowels by Koreans and Poles. The results of the experiments proved that the relation is not linear and that there might be other factors that influence the perception and production than those investigated here. In most of the cases, the comparison of the formant values (F1, F2) between Koreans and Poles proved to determine the perception. However, in some cases certain vowels pronounced by Poles were not perceived as the intended ones, although they showed no significant differences with those pronounced by Koreans and perceived as they were intended to be.
-
This study aims to define the isochronism of English feet. To assess the average value of moras of a foot, the study is, first, to set up a way of counting the number of the moras on the extrametricality with some modifications. Secondly, with the measurement of the average duration of feet of Shakespeare's 120 sonnets through Praat (version 4.030, 2002), it clarifies the foot isochronism in English. With the two ways of measuring the isochronism, it clarifies the fact the foot isochronism permits the difference scope of
$2.2{\mu}'s$ (moras) to$1.8{\mu}'s$ , that is,$22{\mu}'s$ to$18{\mu}'s$ per line, while the acoustic assessment shows the isochronically congnitive gap of 302-447msec. per foot, or 4,461msec. to 3,019msec. per line in case of iambic pentameter in English poetry. -
This study is to investigate the phonetic and phonological characteristics in medieval galician portuguese. It is necessary to consider the phonetic and phonological changes from latin to galician portuguese to understand the phonetic and phonological characteristics in contemporary portuguese. This study considered the palatalization, the phonetic changes in consonant clusters, intervocalic consonant deletions, vowel diphthongizations and vowel nasalizations, which were major phonetic and phonological characteristics in medieval galician portuguese.
-
This research was conducted to investigate the priming effect in Korean and English word production by Korean speakers. Picture-naming with distractors was used as experimental task. The type of target language, the type of distractor language and SOA(Stimulus Onset Asynchrony) were used as variables. Cross-linguistic priming effect and within-linguistic priming effect were mesured to investigate bilinguals' conceptual system.
-
English is one of the stress-timed languages and has much more dynamic rhythm, stress and the tendency toward the isochronism of stressed syllables. It goes with various English utterance restructuring, irrespective of the pauses by syntactic boundaries, and post-lexically phonological phenomena. Specifically in the real speech acts, the natural utterances of fluent speakers or the broadcasting speech cause much more various English restructuring and phonological phenomena. This has been an obstacle for students in speaking fluent English and understanding normal speech. Therefore, this study tried to focus the most problematic factor in English speaking and listening difficulty on English restructuring and post-lexically phonological phenomena caused by stress-timed rhythm and, second, to point out the importance of teaching English rhythm bearing that in mind.
-
Most existing telephone networks transmit narrowband speech witch has been bandlimited below 4 kHz. Compared with wideband speech up to 8 kHz, narrowband speech shows reduced intelligibility and a muffled quality. Bandwidth extension is a technique to generate wideband speech by reconstructing 4-8 kHz highband speech without any additional information. This paper presents experimental results of the bandwidth extension adopted for 4800 bps CELP speech coder. In this experiment, we examine various methods for reconstruction of wideband spectrum and excitation signal, compare and analyze their performance by performing the subjective preference test and measuring the cepstral distortion.
-
The development of noise robust speech processing systems is becoming increasingly important as speech technology is currently widely applied in real world applications. Recently, to resolve such a noise problem, adaptive noise canceller(ANC) is frequently used, which is based upon adaptive filters. The adaptive recursive filters perform better than adaptive non-recursive filters due to the added poles, but the stability may be severely threatened. But these problems of adaptive recursive filters was solved by ACHARF algorithm. This paper presents a method which combines speaker verification system with ANC(Adaptive Noise Canceller) using the ACHARF algorithm. In the front-end stage, ANC is adopted to suppress the additive noise imposed on the speech signal. The results show that the performance of speaker verification system becomes better than before.
-
The speech detection is one of the important problems in real-time speech recognition. The accurate detection of speech boundaries is crucial to the performance of speech recognizer. In this paper, we propose a speech detector based on Mel-band selection through training. In order to show the excellence of the proposed algorithm, we compare it with a conventional one, so called, EPD-VAA (EndPoint Detector based on Voice Activity Detection). The proposed speech detector is trained in order to better extract keyword speech than other speech. EPD-VAA usually works well in high SNR but it doesn't work well any more in low SNR. But the proposed algorithm pre-selects useful bands through keyword training and decides the speech boundary according to the energy level of the sub-bands that is previously selected. The experimental result shows that the proposed algorithm outperforms the EPD-VAA.
-
In this paper, a speech enhancement system using microphone array with MMSE-STSA (Minimum Mean Square Error-Short Time Spectral Amplitude) estimator based post-processing is proposed. Speech enhancement is first carried out by conventional delay-and-sum beamforming (DSB). A new MMSE-STSA estimator is then obtained by refining MMSE-STSA estimators from each microphone, which is applied to the output of conventional DSB to obtain additional speech enhancement. Computer simulation for white and pink noises show that the proposed system is superior to other approaches.
-
Lip-reading technology that is studied them is used to compensate speech recognition degradation in noise environment in bi-modal's form. The most important thing is that search for correct lips area in this lip-reading. But, it is hard to forecast stable performance in dynamic environment. Used RASTA filter that show good performance to remove noise in the speech to compensate. This filter shows that improve performance of using time domain of digital filter. To this experiment observes performance of speech recognition only using image information, service chooses possible 22 words and did recognition experiment in car. We used hidden Markov model by speech recognition algorithm to compare this words' recognition performance.
-
In this paper, we propose a phoneme duration modeling in a speech recognition system based on disicion tree state tying. We assume that phone duration has a Gamma distribution. In a training mode, we model mean and variance of each state duration in context-independent phone model based on decision tree state tying. In a recognition mode, we get mean and variance of each context-dependent phone duration form state duration information obtaind during training mode. We make a comparative study of the proposed meth with conventinal methods. Our method results in good performance compared with conventional methods.
-
As a preliminary study for improving recognition performance of the connected digit telephone speech, we investigate feature parameters as well as channel compensation methods of telephone speech. The CMN and RTCN are examined for telephone channel compensation, and the MFCC, DWFBA, SSC and their delta-features are examined as feature parameters. Recognition experiments with database we collected show that in feature level DWFBA is better than MFCC and for channel compensation RTCN is better than CMN. The DWFBA+Delta_ Mel-SSC feature shows the highest recognition rate.
-
Bimodal speech recognition based on lip reading has been studied as a representative method of speech recognition under noisy environments. There are three integration methods of speech and lip modalities as like direct identification, separate identification and dominant recording. In this paper we evaluate the robustness of lip reading methods under the assumption that lip parameters are estimated with errors. We show that the dominant recording approach is more robust than other methods with lip reading experiments. Also, a measure of lip parameter degradation is proposed. This measure can be used in the determination of weighting values of video information.
-
This paper describes the reduction of DB without degradation of speech quality in Corpus-based Speech synthesizer of Korean language. In this paper, it is proposed that the frequency of every unit in reduced DB should reflect the frequency of units in Korean language. So, the target population of every unit is set to be proportional to their frequency in Korean large corpus(780K sentences, 45Mega phonemes). Second, the frequent instances during synthesis should be also maintained in reduced DB. To the last, it is proposed that frequency of every instance should be reflected in clustering criterion and used as criterion for selection of representative instances. The evaluation result with proposed methods reveals better quality than using conventional methods.