• Title/Summary/Keyword: speech event

Search Result 39, Processing Time 0.022 seconds

Retrieval of Player Event in Golf Videos Using Spoken Content Analysis (음성정보 내용분석을 통한 골프 동영상에서의 선수별 이벤트 구간 검색)

  • Kim, Hyoung-Gook
    • The Journal of the Acoustical Society of Korea
    • /
    • v.28 no.7
    • /
    • pp.674-679
    • /
    • 2009
  • This paper proposes a method of player event retrieval using combination of two functions: detection of player name in speech information and detection of sound event from audio information in golf videos. The system consists of indexing module and retrieval module. At the indexing time audio segmentation and noise reduction are applied to audio stream demultiplexed from the golf videos. The noise-reduced speech is then fed into speech recognizer, which outputs spoken descriptors. The player name and sound event are indexed by the spoken descriptors. At search time, text query is converted into phoneme sequences. The lists of each query term are retrieved through a description matcher to identify full and partial phrase hits. For the retrieval of the player name, this paper compares the results of word-based, phoneme-based, and hybrid approach.

An Approach to Chinese Conversations in the Textbook based on Social Units of Communication (중국어 회화문에 대한 의사소통 분석단위에 기초한 접근)

  • Park, Chan-Wook
    • Cross-Cultural Studies
    • /
    • v.49
    • /
    • pp.127-150
    • /
    • 2017
  • The objective of this study is to classify the conversations in Chinese textbooks into four social units (speech community, speech situation, speech event, speech act) adopted by Dell Hymes (1972), and suggest application of the results involving the conversation to the curriculum of Chinese education. Towards this end, this study assumes every conversation in the Chinese textbooks as coordination of specific speech events and acts under specific situations. This study introduces the concept of social unit adopted by Dell Hymes (1972), and elucidates their role in conversation. Thus, this study reconsiders the conversations recorded in the textbooks not from a morphological or syntactic viewpoint but from a speech perspective. Finally, this study suggests effective use of the results in the Chinese conversation classes.

A review of speech perception: The first step for convergence on speech engineering (말소리지각에 대한 종설: 음성공학과의 융복합을 위한 첫 단계)

  • Lee, Young-lim
    • Journal of Digital Convergence
    • /
    • v.15 no.12
    • /
    • pp.509-516
    • /
    • 2017
  • People observe a lot of events in our environment and we do not have any difficulty to perceive events including speech perception. Like perception of biological motion, two main theorists have debated on speech perception. The purpose of this review article is to briefly describe speech perception and compare these two theories of speech perception. Motor theorists claim that speech perception is special to human because we both produce and perceive articulatory events that are processed by innate neuromotor commands. However, direct perception theorists claim that speech perception is not different from nonspeech perception because we only need to detect information directly like all other kinds of event. It is important to grasp the fundamental idea of how human perceive articulatory events for the convergence on speech engineering. Thus, this basic review of speech perception is expected to be able to used for AI, voice recognition technology, speech recognition system, etc.

Audio Event Classification Using Deep Neural Networks (깊은 신경망을 이용한 오디오 이벤트 분류)

  • Lim, Minkyu;Lee, Donghyun;Kim, Kwang-Ho;Kim, Ji-Hwan
    • Phonetics and Speech Sciences
    • /
    • v.7 no.4
    • /
    • pp.27-33
    • /
    • 2015
  • This paper proposes an audio event classification method using Deep Neural Networks (DNN). The proposed method applies Feed Forward Neural Network (FFNN) to generate event probabilities of ten audio events (dog barks, engine idling, and so on) for each frame. For each frame, mel scale filter bank features of its consecutive frames are used as the input vector of the FFNN. These event probabilities are accumulated for the events and the classification result is determined as the event with the highest accumulated probability. For the same dataset, the best accuracy of previous studies was reported as about 70% when the Support Vector Machine (SVM) was applied. The best accuracy of the proposed method achieves as 79.23% for the UrbanSound8K dataset when 80 mel scale filter bank features each from 7 consecutive frames (in total 560) were implemented as the input vector for the FFNN with two hidden layers and 2,000 neurons per hidden layer. In this configuration, the rectified linear unit was suggested as its activation function.

An Aerodynamic Study of Velopharyngeal Closure Function in Cleft Palate Patients (구개열 환자의 비인강폐쇄 기능에 대한 공기역학적 연구)

  • Ahn, Tae-Sub;Yang, Sang-Ill;Shin, Hyo-Keun
    • Speech Sciences
    • /
    • v.1
    • /
    • pp.237-259
    • /
    • 1997
  • Cleft Palate speech appears to have hyper/hyponasality with velopharyngeal insufficiency and articulation disorders. Previous studies on Cleft Palate speech have shown that speech tends to have lower airflow and air pressure. To examine the aerodynamic characteristics of Cleft Palate speech, Aerophone II Voice function Analyzer was used. We measured sound pressure level, airflow, air pressure and glottal power. Three Cleft Palate adults and five normal adults participated in this experiment. The test words are composed of: (1) the sustained vowel /o/ (2) /CiCi/, where C is one of three different stop consonants in Korean (3) /bimi/. Subjects were asked to produce /bimi/ five times without opening their lips. All the data was statistically tested by t-test for Cleft Palate patients before operation groups and control groups and paired t-test for Cleft Palate patients before and after operation groups. The results were as follow: (1) Cleft Palate patients generally speak with incomplete oral closure and lower oral air pressure. As a result, the SPL of Cleft Palate before operation is 3 dB lower than control groups. (2) Airflow of Cleft Palate in phonation and articulation is lower than that of control groups. However, it increased after operation. Lung volume and mean airflow in phonation are significantly increased (p<0.05). (3) Although velopharyngeal function (velar opening rate) of Cleft Palate is poor in comparison with control groups, it was recovered after operation. In this event maximum flow rate and mean airflow rate are significantly increased (p<0.05). (4) Air pressure of Cleft Palate in speech is lower than that of control groups. In general, the air pressure of Cleft Palate increased after operation. In this event air pressure of glottalized consonant is significantly increased (p<0.04). (5) Glottal Power(mean power, mean efficient and mean resistant) of Cleft Palate patients is lower than that of control groups. But mean efficient and mean resistant of Cleft Palate patients increased significantly (p<0.05) after operation.

  • PDF

A review of event perception: The first step for convergence on robotics (사건지각에 대한 종설: 로봇공학과의 융복합을 위한 첫단계)

  • Lee, Young-Lim
    • Journal of Digital Convergence
    • /
    • v.13 no.4
    • /
    • pp.357-368
    • /
    • 2015
  • People observe lots of events around the environment and we can easily recognize the nature of an event from the resulting optic flow. The questions are how do people recognize events and what is the information in the optic flow that enables observers to recognize events. Motor theorists claim that human observers exhibit special sensitivity when perceiving events like speech or biological motion, because we both produce and perceive those events. However, direct perception theorists suggested that speech or biological motion is not special from the perception of all other kinds of event. The purpose of this review article is to address this controversy to critique the motor theory and to describe a direct realist approach to event perception. It is important to understand the fundamental information of how human perceive event perception for the convergence on robotics.

Noise Robust Baseball Event Detection with Multimodal Information (멀티모달 정보를 이용한 잡음에 강인한 야구 이벤트 시점 검출 방법)

  • Young-Ik Kim;Hyun Jo Jung;Minsoo Na;Younghyun Lee;Joonsoo Lee
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2022.11a
    • /
    • pp.136-138
    • /
    • 2022
  • 스포츠 방송/미디어 데이터에서 특정 이벤트 시점을 효율적으로 검출하는 방법은 정보 검색이나 하이라이트, 요약 등을 위해 중요한 기술이다. 이 논문에서는, 야구 중계 방송 데이터에서 투구에 대한 타격 및 포구 이벤트 시점을 강인하게 검출하는 방법으로, 음향 및 영상 정보를 융합하는 방법에 대해 제안한다. 음향 정보에 기반한 이벤트 검출 방법은 계산이 용이하고 정확도가 높은 반면, 영상 정보의 도움 없이는 모호성을 해결하기 힘든 경우가 많이 발생한다. 특히 야구 중계 데이터의 경우, 투수의 투구 시점에 대한 영상 정보를 활용하여 타격 및 포구 이벤트 검출의 정확도를 보다 향상시킬 수 있다. 이 논문에서는 음향 기반의 딥러닝 이벤트 시점 검출 모델과 영상 기반의 보정 방법을 제안하고, 실제 KBO 야구 중계 방송 데이터에 적용한 사례와 실험 결과에 대해 기술한다.

  • PDF

Combining multi-task autoencoder with Wasserstein generative adversarial networks for improving speech recognition performance (음성인식 성능 개선을 위한 다중작업 오토인코더와 와설스타인식 생성적 적대 신경망의 결합)

  • Kao, Chao Yuan;Ko, Hanseok
    • The Journal of the Acoustical Society of Korea
    • /
    • v.38 no.6
    • /
    • pp.670-677
    • /
    • 2019
  • As the presence of background noise in acoustic signal degrades the performance of speech or acoustic event recognition, it is still challenging to extract noise-robust acoustic features from noisy signal. In this paper, we propose a combined structure of Wasserstein Generative Adversarial Network (WGAN) and MultiTask AutoEncoder (MTAE) as deep learning architecture that integrates the strength of MTAE and WGAN respectively such that it estimates not only noise but also speech features from noisy acoustic source. The proposed MTAE-WGAN structure is used to estimate speech signal and the residual noise by employing a gradient penalty and a weight initialization method for Leaky Rectified Linear Unit (LReLU) and Parametric ReLU (PReLU). The proposed MTAE-WGAN structure with the adopted gradient penalty loss function enhances the speech features and subsequently achieve substantial Phoneme Error Rate (PER) improvements over the stand-alone Deep Denoising Autoencoder (DDAE), MTAE, Redundant Convolutional Encoder-Decoder (R-CED) and Recurrent MTAE (RMTAE) models for robust speech recognition.

Acoustic Monitoring and Localization for Social Care

  • Goetze, Stefan;Schroder, Jens;Gerlach, Stephan;Hollosi, Danilo;Appell, Jens-E.;Wallhoff, Frank
    • Journal of Computing Science and Engineering
    • /
    • v.6 no.1
    • /
    • pp.40-50
    • /
    • 2012
  • Increase in the number of older people due to demographic changes poses great challenges to the social healthcare systems both in the Western and as well as in the Eastern countries. Support for older people by formal care givers leads to enormous temporal and personal efforts. Therefore, one of the most important goals is to increase the efficiency and effectiveness of today's care. This can be achieved by the use of assistive technologies. These technologies are able to increase the safety of patients or to reduce the time needed for tasks that do not relate to direct interaction between the care giver and the patient. Motivated by this goal, this contribution focuses on applications of acoustic technologies to support users and care givers in ambient assisted living (AAL) scenarios. Acoustic sensors are small, unobtrusive and can be added to already existing care or living environments easily. The information gathered by the acoustic sensors can be analyzed to calculate the position of the user by localization and the context by detection and classification of acoustic events in the captured acoustic signal. By doing this, possibly dangerous situations like falls, screams or an increased amount of coughs can be detected and appropriate actions can be initialized by an intelligent autonomous system for the acoustic monitoring of older persons. The proposed system is able to reduce the false alarm rate compared to other existing and commercially available approaches that basically rely only on the acoustic level. This is due to the fact that it explicitly distinguishes between the various acoustic events and provides information on the type of emergency that has taken place. Furthermore, the position of the acoustic event can be determined as contextual information by the system that uses only the acoustic signal. By this, the position of the user is known even if she or he does not wear a localization device such as a radio-frequency identification (RFID) tag.

Enhancement of Processing Capabilities of Hippocampus Lobe: A P300 Based Event Related Potential Study

  • Benet, Neelesh;Krishna, Rajalakshmi;Kumar, Vijay
    • Korean Journal of Audiology
    • /
    • v.25 no.3
    • /
    • pp.119-123
    • /
    • 2021
  • Background and Objectives: The influence of music training on different areas of the brain has been extensively researched, but the underlying neurobehavioral mechanisms remain unknown. In the present study, the effects of training for more than three years in Carnatic music (an Indian form of music) on the discrimination ability of different areas of the brain were tested using P300 analysis at three electrode placement sites. Subjects and Methods: A total of 27 individuals, including 13 singers aged 16-30 years (mean±standard deviation, 23±3.2 years) and 14 non-singers aged 16-30 years (mean age, 24±2.9 years), participated in this study. The singers had 3-5 years of formal training experience in Carnatic music. Cortical activities in areas corresponding to attention, discrimination, and memory were tested using P300 analysis, and the tests were performed using the Intelligent Hearing System. Results: The mean P300 amplitude of the singers at the Fz electrode placement site (5.64±1.81) was significantly higher than that of the non-singers (3.85±1.60; t(25)=3.3, p<0.05). The amplitude at the Cz electrode placement site in singers (5.90±2.18) was significantly higher than that in non-singers (3.46±1.40; t(25)=3.3, p<0.05). The amplitude at the Pz electrode placement site in singers (4.94±1.89) was significantly higher than that in non-singers (3.57±1.50; t(25)=3.3, p<0.05). Among singers, the mean P300 amplitude was significantly higher in the Cz site than the other placement sites, and among non-singers, the mean P300 amplitude was significantly higher in the Fz site than the other placement sites, i.e., music training facilitated enhancement of the P300 amplitude at the Cz site. Conclusions: The findings of this study suggest that more than three years of training in Carnatic singing can enhance neural coding to discriminate subtle differences, leading to enhanced discrimination abilities of the brain, mainly in the generation site corresponding to Cz electrode placement.