Search | Korea Science

Deep Learning based Singing Voice Synthesis Modeling (딥러닝 기반 가창 음성합성(Singing Voice Synthesis) 모델링)

Kim, Minae;Kim, Somin;Park, Jihyun;Heo, Gabin;Choi, Yunjeong
- Proceedings of the Korean Institute of Information and Commucation Sciences Conference
- /
- 2022.10a
- /
- pp.127-130
- /
- 2022
This paper is a study on singing voice synthesis modeling using a generator loss function, which analyzes various factors that may occur when applying BEGAN among deep learning algorithms optimized for image generation to Audio domain. and we conduct experiments to derive optimal quality. In this paper, we focused the problem that the L1 loss proposed in the BEGAN-based models degrades the meaning of hyperparameter the gamma(𝛾) which was defined to control the diversity and quality of generated audio samples. In experiments we show that our proposed method and finding the optimal values through tuning, it can contribute to the improvement of the quality of the singing synthesis product.
PDF

Spectral Shape Invariant Real-time Voice Change System (스펙트럼 형태 불변 실시간 음성 변환 시스템)

Kim Weon-Goo
- Journal of the Korean Institute of Intelligent Systems
- /
- v.15 no.1
- /
- pp.48-52
- /
- 2005
In this paper, the spectral shape invariant real-time voice change method is proposed to change one's voice to mechanical voice. For this purpose, LPC analysis and synthesis is used to maintain the spectraum of voice and the pitch of synthesis speech can be changed freely. In the proposed method, gain matching method is applied to excitation signal generator to make the changed voice natural to hear. In order to evaluate the performance of the proposed method, voice change experiments were conducted. Experimental results showed that original speech signal is changed to the mechanical voice signal in which context of the speaker's voice is conveyed correctly in spite of drastic change of pitch. The system is implemented using TI TMS320C6711DSK board to verify the system runs in real time.
https://doi.org/10.5391/JKIIS.2005.15.1.048 인용 PDF KSCI

Voice Frequency Synthesis using VAW-GAN based Amplitude Scaling for Emotion Transformation

Kwon, Hye-Jeong;Kim, Min-Jeong;Baek, Ji-Won;Chung, Kyungyong
- KSII Transactions on Internet and Information Systems (TIIS)
- /
- v.16 no.2
- /
- pp.713-725
- /
- 2022
Mostly, artificial intelligence does not show any definite change in emotions. For this reason, it is hard to demonstrate empathy in communication with humans. If frequency modification is applied to neutral emotions, or if a different emotional frequency is added to them, it is possible to develop artificial intelligence with emotions. This study proposes the emotion conversion using the Generative Adversarial Network (GAN) based voice frequency synthesis. The proposed method extracts a frequency from speech data of twenty-four actors and actresses. In other words, it extracts voice features of their different emotions, preserves linguistic features, and converts emotions only. After that, it generates a frequency in variational auto-encoding Wasserstein generative adversarial network (VAW-GAN) in order to make prosody and preserve linguistic information. That makes it possible to learn speech features in parallel. Finally, it corrects a frequency by employing Amplitude Scaling. With the use of the spectral conversion of logarithmic scale, it is converted into a frequency in consideration of human hearing features. Accordingly, the proposed technique provides the emotion conversion of speeches in order to express emotions in line with artificially generated voices or speeches.
https://doi.org/10.3837/tiis.2022.02.018 인용 PDF KSCI HTML

Voice Synthesis Detection Using Language Model-Based Speech Feature Extraction (언어 모델 기반 음성 특징 추출을 활용한 생성 음성 탐지)

Seung-min Kim;So-hee Park;Dae-seon Choi
- Journal of the Korea Institute of Information Security & Cryptology
- /
- v.34 no.3
- /
- pp.439-449
- /
- 2024
Recent rapid advancements in voice generation technology have enabled the natural synthesis of voices using text alone. However, this progress has led to an increase in malicious activities, such as voice phishing (voishing), where generated voices are exploited for criminal purposes. Numerous models have been developed to detect the presence of synthesized voices, typically by extracting features from the voice and using these features to determine the likelihood of voice generation.This paper proposes a new model for extracting voice features to address misuse cases arising from generated voices. It utilizes a deep learning-based audio codec model and the pre-trained natural language processing model BERT to extract novel voice features. To assess the suitability of the proposed voice feature extraction model for voice detection, four generated voice detection models were created using the extracted features, and performance evaluations were conducted. For performance comparison, three voice detection models based on Deepfeature proposed in previous studies were evaluated against other models in terms of accuracy and EER. The model proposed in this paper achieved an accuracy of 88.08%and a low EER of 11.79%, outperforming the existing models. These results confirm that the voice feature extraction method introduced in this paper can be an effective tool for distinguishing between generated and real voices.
https://doi.org/10.13089/JKIISC.2024.34.3.439 인용 PDF HTML

An Interactive Voice Web Browser Usable as a Multimodal Interface in Information Devices by Using VoiceXML

Jang, Min-Seok
- Journal of the Korean Institute of Intelligent Systems
- /
- v.14 no.6
- /
- pp.771-775
- /
- 2004
The present Web surroundings is mostly composed of HTML(Hypertext Mark-up Language) and thereby users obtain web informations mainly in GUI(Graphical User Interface) environment by clicking mouse in order to keep up with hyperlinked informations. However it is very inconvenient to work in this environment comparing with easily accessed one in which human`s voice is utilized for obtaining informations. Using VoiceXML, resulted from XML, for supplying the information through telephone on the basis of the contemporary matured technology of voice recognition/synthesis to work out the inconvenience problem, this paper presents the research results about VoiceXML VUI(Voice User Interface) Browser designed and implemented for realizing its technology and also the VoiceXML Dialog designed for the purpose of the browser's efficient use.
https://doi.org/10.5391/JKIIS.2004.14.6.771 인용 PDF KSCI

Voice Source Modeling Using Harmonic Compensated LF Model (LF 모델에 고조파 성분을 보상한 음원 모델링)

이건웅;김태우홍재근
- Proceedings of the IEEK Conference
- /
- 1998.10a
- /
- pp.1247-1250
- /
- 1998
In speech synthesis, LF model is widely used for excitation signal for voice source coding system. But LF model does not represent the harmonic frequencies of excitation signal. We propose an effective method which use sinusoidal functions for representing the harmonics of voice source signal. The proposed method could achieve more exact voice source waveform and better synthesized speech quality than LF model.
PDF

A Study on a Searching, Extraction and Approximation-Synthesis of Transition Segment in Continuous Speech (연속음성에서 천이구간의 탐색, 추출, 근사합성에 관한 연구)

Lee, Si-U
- The Transactions of the Korea Information Processing Society
- /
- v.7 no.4
- /
- pp.1299-1304
- /
- 2000
In a speed coding system using excitation source of voiced and unvoiced, it would be involved a distortion of speech quality in case coexist with a voiced and an unvoiced consonants in a frame. So, I propose TSIUVC(Transition Segment Including UnVoiced Consonant) searching, extraction ad approximation-synthesis method in order to uncoexistent with a voiced and unvoiced consonants in a frame. This method based on a zerocrossing rate and pitch detector using FIR-STREAK Digital Filter. As a result, the extraction rates of TSIUVC are 84.8% (plosive), 94.9%(fricative), 92.3%(affricative) in female voice, and 88%(plosive), 94.9%(fricative), 92.3%(affricative) in male voice respectively, Also, I obain a high quality approximation-synthesis waveforms within TSIUVC by using frequency information of 0.547kHz below and 2.813kHz above. This method has the capability of being applied to speech coding of low bit rate, speech analysis and speech synthesis.
PDF

Analysis of Voice Color Similarity for the development of HMM Based Emotional Text to Speech Synthesis (HMM 기반 감정 음성 합성기 개발을 위한 감정 음성 데이터의 음색 유사도 분석)

Min, So-Yeon;Na, Deok-Su
- Journal of the Korea Academia-Industrial cooperation Society
- /
- v.15 no.9
- /
- pp.5763-5768
- /
- 2014
Maintaining a voice color is important when compounding both the normal voice because an emotion is not expressed with various emotional voices in a single synthesizer. When a synthesizer is developed using the recording data of too many expressed emotions, a voice color cannot be maintained and each synthetic speech is can be heard like the voice of different speakers. In this paper, the speech data was recorded and the change in the voice color was analyzed to develop an emotional HMM-based speech synthesizer. To realize a speech synthesizer, a voice was recorded, and a database was built. On the other hand, a recording process is very important, particularly when realizing an emotional speech synthesizer. Monitoring is needed because it is quite difficult to define emotion and maintain a particular level. In the realized synthesizer, a normal voice and three emotional voice (Happiness, Sadness, Anger) were used, and each emotional voice consists of two levels, High/Low. To analyze the voice color of the normal voice and emotional voice, the average spectrum, which was the measured accumulated spectrum of vowels, was used and the F1(first formant) calculated by the average spectrum was compared. The voice similarity of Low-level emotional data was higher than High-level emotional data, and the proposed method can be monitored by the change in voice similarity.
https://doi.org/10.5762/KAIS.2014.15.9.5763 인용 PDF KSCI

Design and Implementation of VoiceXML VUI Browser (VoiceXML VUI Browser 설계/구현)

장민석;예상후
- Proceedings of the Korean Institute of Information and Commucation Sciences Conference
- /
- 2002.11a
- /
- pp.788-791
- /
- 2002
The present Web surroundings is composed of HTML(Hypertext Mark-up Language) and thereby users obtains web informations mainly in GUI(Graphical User Interface) environment by clicking mouse in order to keep up with hyperlinked informations. However it is very inconvenient to work in this environment comparing with easily accessed one in which human's voice is utilized for obtaining informations. Using VoiceXML, resulted from XML, for supplying the information through telephone on the basis of the contemporary matured technology of voice recognition/synthesis to work out the inconvenience problem, this paper presents the research results about VoiceXML Web Browser designed and implemented for realizing its technology.
PDF

Design and Implementation of the low power and high quality audio encoder/decoder for voice synthesis (음성 합성용 저전력 고음질 부호기/복호기 설계 및 구현)

Park, Nho-Kyung;Park, Sang-Bong;Heo, Jeong-Hwa
- The Journal of the Institute of Internet, Broadcasting and Communication
- /
- v.13 no.6
- /
- pp.55-61
- /
- 2013
In this paper, we describe design and implementation of audio encoder/decoder for voice synthesis. It uses the encoding of difference value of successive samples instead of the original sample value. and has the compression ratio of 4. The function is verified by using FPGA and the performance is measured by the fabricated chip using $0.35{\mu}m$ standard CMOS process. The system clock is 16.384MHz. The measured THD+n is from -40dB to -80dB with frequency variation and the power consumption is about 80mW. It is suited for the mobile application of high audio quality and low power consumption.
https://doi.org/10.7236/JIIBC.2013.13.6.55 인용 PDF KSCI

Search Result 103, Processing Time 0.028 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)