Search | Korea Science

A Study of Data Augmentation and Auto Speech Recognition for the Elderly (한국어 노인 음성 데이터 증강 및 인식 연구 )

Keon Hee Kim;Seoyoon Park;Hansaem Kim
- Annual Conference on Human and Language Technology
- /
- 2023.10a
- /
- pp.56-60
- /
- 2023
기존의 음성인식은 청장년 층에 초점이 맞추어져 있었으나, 최근 고령화가 가속되면서 노인 음성에 대한 연구 필요성이 증대되고 있다. 그러나 노인 음성 데이터셋은 청장년 음성 데이터셋에 비해서는 아직까지 충분히 확보되지 못하고 있다. 본 연구에서는 부족한 노인 음성 데이터셋 확보에 기여하고자 희소한 노인 데이터셋을 증강할 수 있는 방법론에 대해 연구하였다. 이를 위해 노인 음성 특징(feature)을 분석하였으며, '주파수'와 '발화 속도' 특징을 일반 성인 음성에 합성하여 데이터를 증강하였다. 이후 Whisper small 모델을 파인 튜닝한 뒤 노인 음성에 대한 CER(Character Error Rate)를 구하였고, 기존 노인 데이터셋에 증강한 데이터셋을 함께 사용하는 것이 가장 효과적임을 밝혀내었다.
PDF

End-to-end speech recognition models using limited training data (제한된 학습 데이터를 사용하는 End-to-End 음성 인식 모델)

Kim, June-Woo;Jung, Ho-Young
- Phonetics and Speech Sciences
- /
- v.12 no.4
- /
- pp.63-71
- /
- 2020
Speech recognition is one of the areas actively commercialized using deep learning and machine learning techniques. However, the majority of speech recognition systems on the market are developed on data with limited diversity of speakers and tend to perform well on typical adult speakers only. This is because most of the speech recognition models are generally learned using a speech database obtained from adult males and females. This tends to cause problems in recognizing the speech of the elderly, children and people with dialects well. To solve these problems, it may be necessary to retain big database or to collect a data for applying a speaker adaptation. However, this paper proposes that a new end-to-end speech recognition method consists of an acoustic augmented recurrent encoder and a transformer decoder with linguistic prediction. The proposed method can bring about the reliable performance of acoustic and language models in limited data conditions. The proposed method was evaluated to recognize Korean elderly and children speech with limited amount of training data and showed the better performance compared of a conventional method.
https://doi.org/10.13064/KSSS.2020.12.4.063 인용 PDF KSCI

Conformer-based Elderly Speech Recognition using Feature Fusion Module (피쳐 퓨전 모듈을 이용한 콘포머 기반의 노인 음성 인식)

Minsik Lee;Jihie Kim
- Annual Conference on Human and Language Technology
- /
- 2023.10a
- /
- pp.39-43
- /
- 2023
자동 음성 인식(Automatic Speech Recognition, ASR)은 컴퓨터가 인간의 음성을 텍스트로 변환하는 기술이다. 자동 음성 인식 시스템은 다양한 응용 분야에서 사용되며, 음성 명령 및 제어, 음성 검색, 텍스트 트랜스크립션, 자동 음성 번역 등 다양한 작업을 목적으로 한다. 자동 음성 인식의 노력에도 불구하고 노인 음성 인식(Elderly Speech Recognition, ESR)에 대한 어려움은 줄어들지 않고 있다. 본 연구는 노인 음성 인식에 콘포머(Conformer)와 피쳐 퓨전 모듈(Features Fusion Module, FFM)기반 노인 음성 인식 모델을 제안한다. 학습, 평가는 VOTE400(Voide Of The Elderly 400 Hours) 데이터셋으로 한다. 본 연구는 그동안 잘 이뤄지지 않았던 콘포머와 퓨전피쳐를 사용해 노인 음성 인식을 위한 딥러닝 모델을 제시하였다는데 큰 의미가 있다. 또한 콘포머 모델보다 높은 수준의 정확도를 보임으로써 노인 음성 인식을 위한 딥러닝 모델 연구에 기여했다.
PDF

Deep learning-based speech recognition for Korean elderly speech data including dementia patients (치매 환자를 포함한 한국 노인 음성 데이터 딥러닝 기반 음성인식)

Jeonghyeon Mun;Joonseo Kang;Kiwoong Kim;Jongbin Bae;Hyeonjun Lee;Changwon Lim
- The Korean Journal of Applied Statistics
- /
- v.36 no.1
- /
- pp.33-48
- /
- 2023
In this paper we consider automatic speech recognition (ASR) for Korean speech data in which elderly persons randomly speak a sequence of words such as animals and vegetables for one minute. Most of the speakers are over 60 years old and some of them are dementia patients. The goal is to compare deep-learning based ASR models for such data and to find models with good performance. ASR is a technology that can recognize spoken words and convert them into written text by computers. Recently, many deep-learning models with good performance have been developed for ASR. Training data for such models are mostly composed of the form of sentences. Furthermore, the speakers in the data should be able to pronounce accurately in most cases. However, in our data, most of the speakers are over the age of 60 and often have incorrect pronunciation. Also, it is Korean speech data in which speakers randomly say series of words, not sentences, for one minute. Therefore, pre-trained models based on typical training data may not be suitable for our data, and hence we train deep-learning based ASR models from scratch using our data. We also apply some data augmentation methods due to small data size.
https://doi.org/10.5351/KJAS.2023.36.1.033 인용 PDF

Trends of Spontaneous Speech Dialogue Processing Technology (자유발화형 음성대화처리 기술동향)

Kwon, O.W.;Choi, S.K.;Roh, Y.H.;Kim, Y.K.;Park, J.G.;Lee, Y.K.
- Electronics and Telecommunications Trends
- /
- v.30 no.4
- /
- pp.26-35
- /
- 2015
모바일 혁명 빅데이터와 사물인터넷 시대에 접어들면서 인간의 음성과 말로 다양한 장치와 서비스를 제어하고 이용하는 것은 당연시되고 있다. 음성대화처리 기술은 인간 중심의 자유로운 발화를 인식하고 이해 및 처리하는 방향으로 발전하게 될 것이다. 본고에서는 현재 음성대화처리 기술 국내외 기술 및 산업 동향과 지식재산권 동향을 살펴보고, 인간 중심의 자유발화형 음성대화처리 기술 개념과 발전방향에 대해 기술한다.
PDF

The Structure of Solving VoIP Firewall/NAT Traversal Problem (VoIP Firewall/NAT Traversal 문제 해결을 위한 구조)

Choi, Kyoung-Ho;Kang, Boo-Joong;Ro, In-Woo;Im, Eul-Gyu
- Proceedings of the Korean Information Science Society Conference
- /
- 2007.06d
- /
- pp.229-233
- /
- 2007
VoIP(Voice over Internet Protocol)란 음성 데이터를 IP 데이터그램 방식으로 기존 인터넷망을 통해 전달해 주는 기술을 말한다. 기존 인터넷망을 이용하여 음성 데이터를 전달해 줌으로써 기존의 음성 전화 서비스에서 사용되던 회선비용을 크게 절감할 수 있다는 점은 VoIP의 장점 중 하나이다. 그런데 VoIP를 기존의 인터넷망에 그대로 적용하기에는 VoIP에서 사용되는 프로토콜의 특성으로 인해 어려움이 따르게 된다. 즉, 기존의 인터넷망에서 사용되고 있는 방화벽과 NAT(Network Address Translator)장비는 보안을 위해서는 필수적인 요소들 이지만, VoIP의 통신 입장에서는 음성 데이터의 원활한 통신을 방해하는 요소로 작용을 하게 된다. 이러한 문제는 VoIP 통신에 사용되는 시그널링 프로토콜인 H.323과 SIP 프로토콜의 연결 설정과 데이터 전송에 사용되는 동작 방식이 방화벽과 NAT장비의 기능에 충돌하는 점 때문에 발생하게 된다. 따라서 기존의 인터넷망을 그대로 사용하면서 VoIP의 통신이 원활하게 이루어지도록 하기 위해서는 이러한 문제의 해결이 반드시 이루어져야 한다. 본 논문에서는 기존에 Firewall/NAT Traversal 문제 해결을 위해 연구되던 기법들에 대해 살펴보고, 새로운 구조를 제시한다.
PDF

BackTranScription (BTS)-based Jeju Automatic Speech Recognition Post-processor Research (BackTranScription (BTS)기반 제주어 음성인식 후처리기 연구)

Park, Chanjun;Seo, Jaehyung;Lee, Seolhwa;Moon, Heonseok;Eo, Sugyeong;Jang, Yoonna;Lim, Heuiseok
- Annual Conference on Human and Language Technology
- /
- 2021.10a
- /
- pp.178-185
- /
- 2021
Sequence to sequence(S2S) 기반 음성인식 후처리기를 훈련하기 위한 학습 데이터 구축을 위해 (음성인식 결과(speech recognition sentence), 전사자(phonetic transcriptor)가 수정한 문장(Human post edit sentence))의 병렬 말뭉치가 필요하며 이를 위해 많은 노동력(human-labor)이 소요된다. BackTranScription (BTS)이란 기존 S2S기반 음성인식 후처리기의 한계점을 완화하기 위해 제안된 데이터 구축 방법론이며 Text-To-Speech(TTS)와 Speech-To-Text(STT) 기술을 결합하여 pseudo 병렬 말뭉치를 생성하는 기술을 의미한다. 해당 방법론은 전사자의 역할을 없애고 방대한 양의 학습 데이터를 자동으로 생성할 수 있기에 데이터 구축에 있어서 시간과 비용을 단축 할 수 있다. 본 논문은 BTS를 바탕으로 제주어 도메인에 특화된 음성인식 후처리기의 성능을 향상시키기 위하여 모델 수정(model modification)을 통해 성능을 향상시키는 모델 중심 접근(model-centric) 방법론과 모델 수정 없이 데이터의 양과 질을 고려하여 성능을 향상시키는 데이터 중심 접근(data-centric) 방법론에 대한 비교 분석을 진행하였다. 실험결과 모델 교정없이 데이터 중심 접근 방법론을 적용하는 것이 성능 향상에 더 도움이 됨을 알 수 있었으며 모델 중심 접근 방법론의 부정적 측면 (negative result)에 대해서 분석을 진행하였다.
PDF

A Study on the Pilot Application of Disaster Information Delivery and Evacuation Support System for the Vulnerable Groups (안전취약계층 대상 재난정보 전달 및 대피지원 체계 시범적용 연구)

Jung Tae-Ho;Lee, Han-Jun
- Proceedings of the Korean Society of Disaster Information Conference
- /
- 2022.10a
- /
- pp.139-140
- /
- 2022
본 연구는 재난발생 시 현장 상황 판단 및 대응 능력이 현저히 떨어지는 안전취약계층 중 장애인과 노인이 재난정보를 받고 안전하게 대피·대응할 수 있도록 지원하는 시스템의 시범적용에 대한 것이다. 재난정보 전달 및 대피지원 시스템은 재난에 취약한 장애인과 노인이 재난상황에서 각각의 취약 특성을 고려하여 재난 위기상황에 대응하도록 설계 하였으며, 실내 위치측정에 관한 공간정보 표준 및 데이터 구축과 개발한 시스템을 설치하고 구현할 수 있는 실내 공간을 선정하여 시범적으로 적용하였다. 재난정보 및 대피지원 시스템의 시범적용을 위해 선정된 시설의 실내 공간정보 구축 및 현행화를 통해 실내 대피경로를 구축하고, 실제 시범적용을 통해 보완하는 과정을 수행하였다. 장애인·노인 대상 재난정보 전달 및 대피지원 서비스 구축을 위해 실내에서 재난발생 시 재난정보 데이터를 실시간으로 수집하고 스마트폰과 연계할 수 있는 연계 모듈을 개발하였다. 또한 재난정보를 스마트폰에 알릴 수 있는 알람 푸시 모듈, 재난정보 및 대피 안내 모듈과 시각 장애인의 실내 공간 인지를 위한 음성안내 모듈을 개발하였다. 본 연구의 목적은 IoT 기반의 통합관제 기술을 활용하는 서비스 제공을 통해서 정보전달의 사각지대를 해소하고 장애인·노인의 효율적 재난 대응을 위해 맞춤형 재난정보 전달 및 대피지원 서비스를 구축하고 시범적용의 과정을 통해 문제점을 보완하여 최종적으로 재난으로부터 안전취약계층의 안전성을 향상시키는데 목적이 있다.
PDF

Comparison of Classification Performance Between Adult and Elderly Using Acoustic and Linguistic Features from Spontaneous Speech (자유대화의 음향적 특징 및 언어적 특징 기반의 성인과 노인 분류 성능 비교)

SeungHoon Han;Byung Ok Kang;Sunghee Dong
- KIPS Transactions on Software and Data Engineering
- /
- v.12 no.8
- /
- pp.365-370
- /
- 2023
This paper aims to compare the performance of speech data classification into two groups, adult and elderly, based on the acoustic and linguistic characteristics that change due to aging, such as changes in respiratory patterns, phonation, pitch, frequency, and language expression ability. For acoustic features we used attributes related to the frequency, amplitude, and spectrum of speech voices. As for linguistic features, we extracted hidden state vector representations containing contextual information from the transcription of speech utterances using KoBERT, a Korean pre-trained language model that has shown excellent performance in natural language processing tasks. The classification performance of each model trained based on acoustic and linguistic features was evaluated, and the F1 scores of each model for the two classes, adult and elderly, were examined after address the class imbalance problem by down-sampling. The experimental results showed that using linguistic features provided better performance for classifying adult and elderly than using acoustic features, and even when the class proportions were equal, the classification performance for adult was higher than that for elderly.
https://doi.org/10.3745/KTSDE.2023.12.8.365 인용 PDF

Age classification of emergency callers based on behavioral speech utterance characteristics (발화행태 특징을 활용한 응급상황 신고자 연령분류)

Son, Guiyoung;Kwon, Soonil;Baik, Sungwook
- The Journal of Korean Institute of Next Generation Computing
- /
- v.13 no.6
- /
- pp.96-105
- /
- 2017
In this paper, we investigated the age classification from the speaker by analyzing the voice calls of the emergency center. We classified the adult and elderly from the call center calls using behavioral speech utterances and SVM(Support Vector Machine) which is a machine learning classifier. We selected two behavioral speech utterances through analysis of the call data from the emergency center: Silent Pause and Turn-taking latency. First, the criteria for age classification selected through analysis based on the behavioral speech utterances of the emergency call center and then it was significant(p <0.05) through statistical analysis. We analyzed 200 datasets (adult: 100, elderly: 100) by the 5 fold cross-validation using the SVM(Support Vector Machine) classifier. As a result, we achieved 70% accuracy using two behavioral speech utterances. It is higher accuracy than one behavioral speech utterance. These results can be suggested age classification as a new method which is used behavioral speech utterances and will be classified by combining acoustic information(MFCC) with new behavioral speech utterances of the real voice data in the further work. Furthermore, it will contribute to the development of the emergency situation judgment system related to the age classification.

Search Result 16, Processing Time 0.019 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)