• 제목/요약/키워드: Speech Data

검색결과 1,388건 처리시간 0.025초

A Study on the Impact of Speech Data Quality on Speech Recognition Models

  • Yeong-Jin Kim;Hyun-Jong Cha;Ah Reum Kang
    • 한국컴퓨터정보학회논문지
    • /
    • 제29권1호
    • /
    • pp.41-49
    • /
    • 2024
  • 현재 음성인식 기술은 꾸준히 발전하고 다양한 분야에서 널리 사용되고 있다. 본 연구에서는 음성 데이터 품질이 음성인식 모델에 미치는 영향을 알아보기 위해 데이터셋을 전체 데이터셋과 SNR 상위 70%의 데이터셋으로 나눈 후 Seamless M4T와 Google Cloud Speech-to-Text를 이용하여 각 모델의 텍스트 변환 결과를 확인하고 Levenshtein Distance를 사용하여 평가하였다. 실험 결과에서 Seamless M4T는 높은 SNR(신호 대 잡음비)을 가진 데이터를 사용한 모델에서 점수가 13.6으로 전체 데이터셋의 점수인 16.6보다 더 낮게 나왔다. 그러나 Google Cloud Speech-to-Text는 전체 데이터셋에서 8.3으로 높은 SNR을 가진 데이터보다 더 낮은 점수가 나왔다. 이는 새로운 음성인식 모델을 훈련할 때 SNR이 높은 데이터를 사용하는 것이 영향이 있다고 할 수 있으며, Levenshtein Distance 알고리즘이 음성인식 모델을 평가하기 위한 지표 중 하나로 쓰일 수 있음을 나타낸다.

디지털 방송을 위한 패치워크 기반 음성 워터마크 (Speech Watermark Based on Patchwork for Digital Broadcasting)

  • 여인권;김형중;최용희;김기섭
    • 방송공학회논문지
    • /
    • 제5권2호
    • /
    • pp.220-226
    • /
    • 2000
  • 본 논문에서는 방송용 음성에 워터마크를 삽입하는 방법을 제시했다. 디지털 방송에서는 오디오와 음성을 일부러 구별하지는 않는다. 그러나 교육방송에서는 음성의 중요성이 비디오나 오디오에 비해 훨씬 크고 컨텐츠에서 차지하는 비중도 높다. 디지털 방송에서 중요한 이슈 가운데 하나가 바로 불법복제에 대한 대비책이다. 이 논문에서는 음성용으로 변형한 오디오 워터마크의 성능과 한계에 대해 설명하고, 공격에 대한 내성 결과를 제시했다. 그리고 음성 워터마크 연구에서 해결해야 할 과제들을 제시했다.

  • PDF

히스토그램 등화와 데이터 증강 기법을 이용한 개선된 음성 감정 인식 (Improved speech emotion recognition using histogram equalization and data augmentation techniques)

  • 허운행;권오욱
    • 말소리와 음성과학
    • /
    • 제9권2호
    • /
    • pp.77-83
    • /
    • 2017
  • We propose a new method to reduce emotion recognition errors caused by variation in speaker characteristics and speech rate. Firstly, for reducing variation in speaker characteristics, we adjust features from a test speaker to fit the distribution of all training data by using the histogram equalization (HE) algorithm. Secondly, for dealing with variation in speech rate, we augment the training data with speech generated in various speech rates. In computer experiments using EMO-DB, KRN-DB and eNTERFACE-DB, the proposed method is shown to improve weighted accuracy relatively by 34.7%, 23.7% and 28.1%, respectively.

Windows95 환경에서의 음성 인터페이스 구현 (Implementation of speech interface for windows 95)

  • 한영원;배건성
    • 전자공학회논문지S
    • /
    • 제34S권5호
    • /
    • pp.86-93
    • /
    • 1997
  • With recent development of speech recognition technology and multimedia computer systems, more potential applications of voice will become a reality. In this paper, we implement speech interface on the windows95 environment for practical use fo multimedia computers with voice. Speech interface is made up of three modules, that is, speech input and detection module, speech recognition module, and application module. The speech input and etection module handles th elow-level audio service of win32 API to input speech data on real time. The recognition module processes the incoming speech data, and then recognizes the spoken command. DTW pattern matching method is used for speech recognition. The application module executes the voice command properly on PC. Each module of the speech interface is designed and examined on windows95 environments. Implemented speech interface and experimental results are explained and discussed.

  • PDF

Speech Quality of a Sinusoidal Model Depending on the Number of Sinusoids

  • Seo, Jeong-Wook;Kim, Ki-Hong;Seok, Jong-Won;Bae, Keun-Sung
    • 음성과학
    • /
    • 제7권1호
    • /
    • pp.17-29
    • /
    • 2000
  • The STC(Sinusoidal Transform Coding) is a vocoding technique that uses a sinusoidal speech model to obtain high- quality speech at low data rate. It models and synthesizes the speech signal with fundamental frequency and its harmonic elements in frequency domain. To reduce the data rate, it is necessary to represent the sinusoidal amplitudes and phases with as small number of peaks as possible while maintaining the speech quality. As a basic research to develop a low-rate speech coding algorithm using the sinusoidal model, in this paper, we investigate the speech quality depending on the number of sinusoids. By varying the number of spectral peaks from 5 to 40 speech signals are reconstructed, and then their qualities are evaluated using spectral envelope distortion measure and MOS(Mean Opinion Score). Two approaches are used to obtain the spectral peaks: one is a conventional STFT (Short-Time Fourier Transform), and the other is a multiresolutional analysis method.

  • PDF

Collection of Korean Audio-video Speech Data

  • Jo, Cheol-Woo;Goecke, Roland;Millar, Bruce
    • 음성과학
    • /
    • 제7권1호
    • /
    • pp.5-15
    • /
    • 2000
  • In this paper a detailed description of collecting Korean audio-video speech data is presented. The main aim of this experiment is to collect some audio-video materials which can be used for later experiments to estimate and model the actions of the visible human articulatory organs such as mouth, lips and jaw. We collect audio-video data from seven directions separately. Twelve markers are used to trace the movements.

  • PDF

훈련음성 데이터에 적응시킨 필터뱅크 기반의 MFCC 특징파라미터를 이용한 전화음성 연속숫자음의 인식성능 향상에 관한 연구 (A study on the recognition performance of connected digit telephone speech for MFCC feature parameters obtained from the filter bank adapted to training speech database)

  • 정성윤;김민성;손종목;배건성;강점자
    • 대한음성학회:학술대회논문집
    • /
    • 대한음성학회 2003년도 5월 학술대회지
    • /
    • pp.119-122
    • /
    • 2003
  • In general, triangular shape filters are used in the filter bank when we get the MFCCs from the spectrum of speech signal. In [1], a new feature extraction approach is proposed, which uses specific filter shapes in the filter bank that are obtained from the spectrum of training speech data. In this approach, principal component analysis technique is applied to the spectrum of the training data to get the filter coefficients. In this paper, we carry out speech recognition experiments, using the new approach given in [1], for a large amount of telephone speech data, that is, the telephone speech database of Korean connected digit released by SITEC. Experimental results are discussed with our findings.

  • PDF

디지털 이동통신 채널상의 14Kbps SBC-APCM(AQB)를 위한 비트선택적 에러정정부호 (Bit-selective Forward Error Correction for 14Kbps SBC-APCM (AQB) over Digital Mobile Communication Channels)

  • 김민구;이재홍
    • 대한전자공학회논문지
    • /
    • 제27권6호
    • /
    • pp.821-828
    • /
    • 1990
  • A forward error correction (FEC) technique is presented for speech data in 16 Kbps digital mobile communications. 14Kbps SBC-APCM(AQB) and QPSK are used as speech coding and modulation techniques, respectively. Because each bit in a speech data block had different importance, applying FEC to speech data bit-selectively in more effective than applying FEC to all speech data equally. To select bits in a speech data block to be protected by FEC the bit error sensitivity of each bit is computed. For a few BCH and Reed-Solomon codes used as bit-selective FEC the performance of the coding technique is computed.

  • PDF

QR 코드에 음성 데이터 삽입을 위한 AMR 압축 비트열 분석 (Analysis of AMR Compressed Bit Stream for Insertion of Voice Data in QR Code)

  • 오은주;조현지;정현아;배정은;유훈
    • 한국정보통신학회:학술대회논문집
    • /
    • 한국정보통신학회 2018년도 추계학술대회
    • /
    • pp.490-492
    • /
    • 2018
  • 본 논문은 음성 데이터를 QR 코드에 입력 및 전송하는 기법을 연구하기 위해 실생활에 가장 많이 사용되는 AMR 음성 데이터를 분석한 결과를 제공한다. AMR은 HEADER와 Speech Data로 구성되어 있고, 비트 형식으로 전송되고 있으며 총 8개의 비트 전송률 모드를 갖고 있다. HEADER에는 Speech Data의 모드 정보가 포함되어 있으며 모드에 따라 Speech Data의 길이는 달라진다. 그 중 QR 코드에 삽입하기 가장 적절한 전송률 모드를 선택하고 해당 모드에 대한 분석을 제공한다. 각 모드에 대한 분석 및 실험을 통해 추후 음성 데이터에 대해 더 높은 압축률을 보이는 것이 최종 목표이다. 그럼으로써 음성 데이터를 보다 효율적으로 전송할 수 있다는 점에서 성능 개선을 보인다.

  • PDF

음성기술을 이용한 정신피로 측정에 관한 타당성 연구 (A Validity Study on Measurement of Mental Fatigue Using Speech Technology)

  • 송승규;김종열;장준수;권철홍
    • 말소리와 음성과학
    • /
    • 제5권1호
    • /
    • pp.3-10
    • /
    • 2013
  • This study proposes a method to measure mental fatigue using speech technology, which has not been used in previous research and is easier than existing complex and difficult methods. It aims at establishing a relationship between the human voice and mental fatigue based on experiments to measure the influence of mental fatigue on the human voice. Two monotonous tasks of simple calculation such as finding the sum of three one digit numbers were used to measure the feeling of monotony and two sets of subjective questionnaires were used to measure mental fatigue. While thirty subjects perform the experiment, responses to the questionnaire and speech data were collected. Speech features related to speech source and the vocal tract filter were extracted from the speech data. According to the results, speech parameters deeply related to mental fatigue are a mean and standard deviation of fundamental frequency, jitter, and shimmer. This study shows that speech technology is a useful method for measuring mental fatigue.