• Title/Summary/Keyword: Text data

Search Result 2,953, Processing Time 0.039 seconds

A Korean menu-ordering sentence text-to-speech system using conformer-based FastSpeech2 (콘포머 기반 FastSpeech2를 이용한 한국어 음식 주문 문장 음성합성기)

  • Choi, Yerin;Jang, JaeHoo;Koo, Myoung-Wan
    • The Journal of the Acoustical Society of Korea
    • /
    • v.41 no.3
    • /
    • pp.359-366
    • /
    • 2022
  • In this paper, we present the Korean menu-ordering Sentence Text-to-Speech (TTS) system using conformer-based FastSpeech2. Conformer is the convolution-augmented transformer, which was originally proposed in Speech Recognition. Combining two different structures, the Conformer extracts better local and global features. It comprises two half Feed Forward module at the front and the end, sandwiching the Multi-Head Self-Attention module and Convolution module. We introduce the Conformer in Korean TTS, as we know it works well in Korean Speech Recognition. For comparison between transformer-based TTS model and Conformer-based one, we train FastSpeech2 and Conformer-based FastSpeech2. We collected a phoneme-balanced data set and used this for training our models. This corpus comprises not only general conversation, but also menu-ordering conversation consisting mainly of loanwords. This data set is the solution to the current Korean TTS model's degradation in loanwords. As a result of generating a synthesized sound using ParallelWave Gan, the Conformer-based FastSpeech2 achieved superior performance of MOS 4.04. We confirm that the model performance improved when the same structure was changed from transformer to Conformer in the Korean TTS.

Research on the Movie Reviews Regarded as Unsuccessful in Box Office Outcomes in Korea: Based on Big Data Posted on Naver Movie Portal

  • Jeon, Ho-Seong
    • Asia-Pacific Journal of Business
    • /
    • v.12 no.3
    • /
    • pp.51-69
    • /
    • 2021
  • Purpose - Based on literature studies of movie reviews and movie ratings, this study raised two research questions on the contents of online word of mouth and the number of movie screens as mediator variables. Research question 1 wanted to figure out which topics of word groups had a positive or negative impact on movie ratings. Research question 2 tried to identify the role of the number of movie screens between movie ratings and box office outcomes. Design/methodology/approach - Through R program, this study collected about 82,000 movie reviews and movie ratings posted on Naver's movie website to examine the role of online word of mouths and movie screen counts in 10 movies that were considered commercially unsuccessful with fewer than 2 million viewers despite securing about 1,000 movie screens. To confirm research question 1, topic modeling, a text mining technique, was conducted on movie reviews. In addition, this study linked the movie ratings posted on Naver with information of KOBIS by date, to identify the research question 2. Findings - Through topic modeling, 5 topics were identified. Topics found in this study were largely organized into two groups, the content of the movie (topic 1, 2, 3) and the evaluation of the movie (topics 4, 5). When analyzing the relationship between movie reviews and movie ratings with 5 mediators identified in topic modeling to probe research question 1, the topic word groups related to topic 2, 3 and 5 appeared having a negative effect on the netizen's movie ratings. In addition, by connecting two secondary data by date, analysis for research question 2 was implemented. The outcomes showed that the causal relationship between movie ratings and audience numbers was mediated by the number of movie screens. Research implications or Originality - The results suggested that the information presented in text format was harder to quantify than the information provided in scores, but if content information could be digitalized through text mining techniques, it could become variable and be analyzed to identify causality with other variables. The outcomes in research question 2 showed that movie ratings had a direct impact on the number of viewers, but also had indirect effects through changes in the number of movie screens. An interesting point is that the direct effect of movie ratings on the number of viewers is found in most American films released in Korea.

Text Classification Using Heterogeneous Knowledge Distillation

  • Yu, Yerin;Kim, Namgyu
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.10
    • /
    • pp.29-41
    • /
    • 2022
  • Recently, with the development of deep learning technology, a variety of huge models with excellent performance have been devised by pre-training massive amounts of text data. However, in order for such a model to be applied to real-life services, the inference speed must be fast and the amount of computation must be low, so the technology for model compression is attracting attention. Knowledge distillation, a representative model compression, is attracting attention as it can be used in a variety of ways as a method of transferring the knowledge already learned by the teacher model to a relatively small-sized student model. However, knowledge distillation has a limitation in that it is difficult to solve problems with low similarity to previously learned data because only knowledge necessary for solving a given problem is learned in a teacher model and knowledge distillation to a student model is performed from the same point of view. Therefore, we propose a heterogeneous knowledge distillation method in which the teacher model learns a higher-level concept rather than the knowledge required for the task that the student model needs to solve, and the teacher model distills this knowledge to the student model. In addition, through classification experiments on about 18,000 documents, we confirmed that the heterogeneous knowledge distillation method showed superior performance in all aspects of learning efficiency and accuracy compared to the traditional knowledge distillation.

Estimate Customer Churn Rate with the Review-Feedback Process: Empirical Study with Text Mining, Econometrics, and Quai-Experiment Methodologies (리뷰-피드백 프로세스를 통한 고객 이탈률 추정: 텍스트 마이닝, 계량경제학, 준실험설계 방법론을 활용한 실증적 연구)

  • Choi Kim;Jaemin Kim;Gahyung Jeong;Jaehong Park
    • Information Systems Review
    • /
    • v.23 no.3
    • /
    • pp.159-176
    • /
    • 2021
  • Obviating user churn is a prominent strategy to capitalize on online games, eluding the initial investments required for the development of another. Extant literature has examined factors that may induce user churn, mainly from perspectives of motives to play and game as a virtual society. However, such works largely dismiss the service aspects of online games. Dissatisfaction of user needs constitutes a crucial aspect for user churn, especially with online services where users expect a continuous improvement in service quality via software updates. Hence, we examine the relationship between a game's quality management and its user base. With text mining and survival analysis, we identify complaint factors that act as key predictors of user churn. Additionally, we find that enjoyment-related factors are greater threats to user base than usability-related ones. Furthermore, subsequent quasi-experiment shows that improvements in the complaint factors (i.e., via game patches) curb churn and foster user retention. Our results shed light on the responsive role of developers in retaining the user base of online games. Moreover, we provide practical insights for game operators, i.e., to identify and prioritize more perilous complaint factors in planning successive game patches.

WeXGene: Web-based XML Data Generator (WeXGene: 웹 기반 XML 데이터 생성기)

  • Shin Sun Mi;Jeong Hoe Jin;Lee Sang Ho
    • The KIPS Transactions:PartD
    • /
    • v.12D no.2 s.98
    • /
    • pp.199-210
    • /
    • 2005
  • We need XML generate various kinds of XML data to evaluate XML database systems. Existing XML data generators are developed to generate XML data that are suitable for particular evaluation methods, and their functionalities are limited in terms of generating XML data This paper introduces a new XML data generator, WeXGene, that not only improves the drawbacks of existing data generators but also adds new data generation functionalities. For generating XML data WeXGene uses the user data files and the structure definition files, which specify SDTD(Symbolic DTD) or input parameters. The user data file is a text data file that has column data or row data. It is also possible that WeXGene generates XML data without accessing the user data file. This paper presents the design details, overall system architecture, and data generation process of WeXGene. An analytic comparison with other XML data generators is also presented.

A Study on Geo-Data Appliance for Using Geospatial Information of Public Open Data (개방형 공공데이터의 공간정보 활용을 위한 Geo-Data Appliance 구현방안)

  • Yeon, Sung-Hyun;Kim, Hyeon-Deok;Lee, In-Su
    • Journal of Cadastre & Land InformatiX
    • /
    • v.45 no.2
    • /
    • pp.71-85
    • /
    • 2015
  • Recently, the South Korean government actively opens the public data to encourage people to use it in private sector. It is based on 'Government 3.0' that is the paradigm for government operation. According to this trend, a data platform is required for establishing and commercializing business models that utilizing geospatial data. However, it is currently insufficient to establish the geospatial data system using the text-based public open data. This study constructs a geospatial data supply system using the public data for the purpose of providing and using the public data of spatial reference type efficiently. It improves the accessibility of user and the usability to the public data having location information. Besides, this study suggests that the components of the appliance system that connects the public data from different public institutions for different purposes with producing the geospatial data in the type of a finished product.

A Study on the Perception of Data 3 Act through Big Data Analysis (빅데이터 분석을 통한 데이터 3법 인식에 관한 연구)

  • Oh, Jungjoo;Lee, Hwansoo
    • Convergence Security Journal
    • /
    • v.21 no.2
    • /
    • pp.19-28
    • /
    • 2021
  • Korea is promoting a digital new deal policy for the digital transformation and innovation accelerating of the industry. However, because of the strict existing data-related laws, there are still restrictions on the industry's use of data for the digital new deal policy. In order to solve this issue, a revised bill of the Data 3 Act has been proposed, but there is still insufficient discussion on how it will actually affect the activation of data use in the industry. Therefore, this study aims to analyze the perception of public opinion on the Data 3 Act and the implications of the revision of the Data 3 Act. To this end, the revision of the Data 3 Act and related research trends were analyzed, and the perception of the Data 3 Act was analyzed using a big data analysis technique. According to the analysis results, while promoting the vitalization of the data industry in line with the purpose of the revision, the Data 3 Act has a concern that it focuses on specific industries. The results of this study are meaningful in providing implications for future improvement plans by analyzing online perceptions of the industrial impact of the Data 3 Act in the early stages of implementation through big data analysis.

Construction of Event Networks from Large News Data Using Text Mining Techniques (텍스트 마이닝 기법을 적용한 뉴스 데이터에서의 사건 네트워크 구축)

  • Lee, Minchul;Kim, Hea-Jin
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.1
    • /
    • pp.183-203
    • /
    • 2018
  • News articles are the most suitable medium for examining the events occurring at home and abroad. Especially, as the development of information and communication technology has brought various kinds of online news media, the news about the events occurring in society has increased greatly. So automatically summarizing key events from massive amounts of news data will help users to look at many of the events at a glance. In addition, if we build and provide an event network based on the relevance of events, it will be able to greatly help the reader in understanding the current events. In this study, we propose a method for extracting event networks from large news text data. To this end, we first collected Korean political and social articles from March 2016 to March 2017, and integrated the synonyms by leaving only meaningful words through preprocessing using NPMI and Word2Vec. Latent Dirichlet allocation (LDA) topic modeling was used to calculate the subject distribution by date and to find the peak of the subject distribution and to detect the event. A total of 32 topics were extracted from the topic modeling, and the point of occurrence of the event was deduced by looking at the point at which each subject distribution surged. As a result, a total of 85 events were detected, but the final 16 events were filtered and presented using the Gaussian smoothing technique. We also calculated the relevance score between events detected to construct the event network. Using the cosine coefficient between the co-occurred events, we calculated the relevance between the events and connected the events to construct the event network. Finally, we set up the event network by setting each event to each vertex and the relevance score between events to the vertices connecting the vertices. The event network constructed in our methods helped us to sort out major events in the political and social fields in Korea that occurred in the last one year in chronological order and at the same time identify which events are related to certain events. Our approach differs from existing event detection methods in that LDA topic modeling makes it possible to easily analyze large amounts of data and to identify the relevance of events that were difficult to detect in existing event detection. We applied various text mining techniques and Word2vec technique in the text preprocessing to improve the accuracy of the extraction of proper nouns and synthetic nouns, which have been difficult in analyzing existing Korean texts, can be found. In this study, the detection and network configuration techniques of the event have the following advantages in practical application. First, LDA topic modeling, which is unsupervised learning, can easily analyze subject and topic words and distribution from huge amount of data. Also, by using the date information of the collected news articles, it is possible to express the distribution by topic in a time series. Second, we can find out the connection of events in the form of present and summarized form by calculating relevance score and constructing event network by using simultaneous occurrence of topics that are difficult to grasp in existing event detection. It can be seen from the fact that the inter-event relevance-based event network proposed in this study was actually constructed in order of occurrence time. It is also possible to identify what happened as a starting point for a series of events through the event network. The limitation of this study is that the characteristics of LDA topic modeling have different results according to the initial parameters and the number of subjects, and the subject and event name of the analysis result should be given by the subjective judgment of the researcher. Also, since each topic is assumed to be exclusive and independent, it does not take into account the relevance between themes. Subsequent studies need to calculate the relevance between events that are not covered in this study or those that belong to the same subject.

Design and Implementation of HDFS Data Encryption Scheme Using ARIA Algorithms on Hadoop (하둡 상에서 ARIA 알고리즘을 이용한 HDFS 데이터 암호화 기법의 설계 및 구현)

  • Song, Youngho;Shin, YoungSung;Chang, Jae-Woo
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.5 no.2
    • /
    • pp.33-40
    • /
    • 2016
  • Due to the growth of social network systems (SNS), big data are realized and Hadoop was developed as a distributed platform for analyzing big data. Enterprises analyze data containing users' sensitive information by using Hadoop and utilize them for marketing. Therefore, researches on data encryption have been done to protect the leakage of sensitive data stored in Hadoop. However, the existing researches support only the AES encryption algorithm, the international standard of data encryption. Meanwhile, Korean government choose ARIA algorithm as a standard data encryption one. In this paper, we propose a HDFS data encryption scheme using ARIA algorithms on Hadoop. First, the proposed scheme provide a HDFS block splitting component which performs ARIA encryption and decryption under the distributed computing environment of Hadoop. Second, the proposed scheme also provide a variable-length data processing component which performs encryption and decryption by adding dummy data, in case when the last block of data does not contains 128 bit data. Finally, we show from performance analysis that our proposed scheme can be effectively used for both text string processing applications and science data analysis applications.

The bootstrap VQ model for automatic speaker recognition system (VQ 방식의 화자인식 시스템 성능 향상을 위한 부쓰트랩 방식 적용)

  • Kyung YounJeong;Lee Jin-Ick;Lee Hwang-Soo
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • spring
    • /
    • pp.39-42
    • /
    • 2000
  • A bootstrap and aggregating (bagging) vector quantization (VQ) classifier is proposed for speaker recognition. This method obtains multiple training data sets by resampling the original training data set, and then integrates the corresponding multiple classifiers into a single classifier. Experiments involving a closed set, text-independent and speaker identification system are carried out using the TIMIT database. The proposed bagging VQ classifier shows considerably improved performance over the conventional VQ classifier.

  • PDF