• Title/Summary/Keyword: Text data

Search Result 2,953, Processing Time 0.028 seconds

Text Extraction In WWW Images (웹 영상에 포함된 문자 영역의 추출)

  • 김상현;심재창;김중수
    • Proceedings of the IEEK Conference
    • /
    • 2000.06d
    • /
    • pp.15-18
    • /
    • 2000
  • In this paper, we propose a method for text extraction in the Web images. Our approach is based on contrast detecting and pixel component ratio analysis in mouse position. Extracted data with OCR can be used for real time dictionary call or language translation application in Web browser.

  • PDF

Semi-supervised domain adaptation using unlabeled data for end-to-end speech recognition (라벨이 없는 데이터를 사용한 종단간 음성인식기의 준교사 방식 도메인 적응)

  • Jeong, Hyeonjae;Goo, Jahyun;Kim, Hoirin
    • Phonetics and Speech Sciences
    • /
    • v.12 no.2
    • /
    • pp.29-37
    • /
    • 2020
  • Recently, the neural network-based deep learning algorithm has dramatically improved performance compared to the classical Gaussian mixture model based hidden Markov model (GMM-HMM) automatic speech recognition (ASR) system. In addition, researches on end-to-end (E2E) speech recognition systems integrating language modeling and decoding processes have been actively conducted to better utilize the advantages of deep learning techniques. In general, E2E ASR systems consist of multiple layers of encoder-decoder structure with attention. Therefore, E2E ASR systems require data with a large amount of speech-text paired data in order to achieve good performance. Obtaining speech-text paired data requires a lot of human labor and time, and is a high barrier to building E2E ASR system. Therefore, there are previous studies that improve the performance of E2E ASR system using relatively small amount of speech-text paired data, but most studies have been conducted by using only speech-only data or text-only data. In this study, we proposed a semi-supervised training method that enables E2E ASR system to perform well in corpus in different domains by using both speech or text only data. The proposed method works effectively by adapting to different domains, showing good performance in the target domain and not degrading much in the source domain.

Korean and English Sentiment Analysis Using the Deep Learning

  • Ramadhani, Adyan Marendra;Choi, Hyung Rim;Lim, Seong Bae
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.23 no.3
    • /
    • pp.59-71
    • /
    • 2018
  • Social media has immense popularity among all services today. Data from social network services (SNSs) can be used for various objectives, such as text prediction or sentiment analysis. There is a great deal of Korean and English data on social media that can be used for sentiment analysis, but handling such huge amounts of unstructured data presents a difficult task. Machine learning is needed to handle such huge amounts of data. This research focuses on predicting Korean and English sentiment using deep forward neural network with a deep learning architecture and compares it with other methods, such as LDA MLP and GENSIM, using logistic regression. The research findings indicate an approximately 75% accuracy rate when predicting sentiments using DNN, with a latent Dirichelet allocation (LDA) prediction accuracy rate of approximately 81%, with the corpus being approximately 64% accurate between English and Korean.

Analysis of English abstracts in Journal of the Korean Data & Information Science Society using topic models and social network analysis (토픽 모형 및 사회연결망 분석을 이용한 한국데이터정보과학회지 영문초록 분석)

  • Kim, Gyuha;Park, Cheolyong
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.1
    • /
    • pp.151-159
    • /
    • 2015
  • This article analyzes English abstracts of the articles published in Journal of the Korean Data & Information Science Society using text mining techniques. At first, term-document matrices are formed by various methods and then visualized by social network analysis. LDA (latent Dirichlet allocation) and CTM (correlated topic model) are also employed in order to extract topics from the abstracts. Performances of the topic models are compared via entropy for several numbers of topics and weighting methods to form term-document matrices.

Description-Based Multimedia Clipart Retrieval in WWW

  • Kim, Hion-Gun;Sin, Bong-Kee;Song, Ju-Won
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 1998.06b
    • /
    • pp.111-115
    • /
    • 1998
  • The Internet today is teemed with not only text data but also other media such as sound, still and moving images in a variety of formats. Unlike text, however, that can be retrieved easily with the help of numerous search engines, there has been few way to access data of other media unless the exact location or the URL is known. Multimedia data in the WWW are contained in or linked via anchors in the hyper-documents. They can most reliably be retrieved by analyzing the binary data content, which is far from being practical yet by the current state of the art. Instead we present another technique of searching based on textual descriptions which are found at or around the multimedia objects. The textual description used in this research includes file name (URL), anchor text and its context, alternative descriptions found in ALT HTML tage. These are actually the clues assumedly relevant to the contents. Although not without a possibility of missing or misinterpreting images and sounds, the description-based search is highly practical in terms of computation. The prototype search engine will soon be deployed to the public service through the prestige search engine, InfoDetective, in Korea.

  • PDF

Perceptions and Trends of Digital Fashion Technology - A Big Data Analysis - (빅데이터 분석을 이용한 디지털 패션 테크에 대한 인식 연구)

  • Song, Eun-young;Lim, Ho-sun
    • Fashion & Textile Research Journal
    • /
    • v.23 no.3
    • /
    • pp.380-389
    • /
    • 2021
  • This study aimed to reveal the perceptions and trends of digital fashion technology through an informational approach. A big data analysis was conducted after collecting the text shown in a web environment from April 2019 to April 2021. Key words were derived through text mining analysis and network analysis, and the structure of perception of digital fashion technology was identified. Using textoms, we collected 8144 texts after data refinement, conducted a frequency of emergence and central component analysis, and visualized the results with word cloud and N-gram. The frequency of appearance also generated matrices with the top 70 words, and a structural equivalent analysis was performed. The results were presented with network visualizations and dendrograms. Fashion, digital, and technology were the most frequently mentioned topics, and the frequencies of platform, digital transformation, and start-ups were also high. Through clustering, four clusters of marketing were formed using fashion, digital technology, startups, and augmented reality/virtual reality technology. Future research on startups and smart factories with technologies based on stable platforms is needed. The results of this study contribute to increasing the fashion industry's knowledge on digital fashion technology and can be used as a foundational study for the development of research on related topics.

Analysis of Real Estate Market Trend Using Text Mining and Big Data (빅데이터와 텍스트마이닝을 이용한 부동산시장 동향분석)

  • Chun, Hae-Jung
    • Journal of Digital Convergence
    • /
    • v.17 no.4
    • /
    • pp.49-55
    • /
    • 2019
  • This study is on the trend of real estate market using text mining and big data. The data were collected through internet news posted on Naver from August 2016 to August 2017. As a result of TF-IDF analysis, the frequency was high in the order of housing, sale, household, real estate market, and region. Many words related to policies such as loan, government, countermeasures, and regulations were extracted, and the region - related words appeared the most frequently in Seoul. The combination of the words related to the region showed that the frequencies of 'Seoul - Gangnam', 'Seoul - Metropolitan area', 'Gangnam - reconstruction' and 'Seoul - reconstruction' appeared frequently. It can be seen that the people's interest and expectation about the reconstruction of Gangnam area is high.

Text-Mining Analysis of Korea Government R&D Trends in Construction Machinery Domains (텍스트 마이닝을 통한 건설기계분야 국내 정부 R&D 연구동향 분석)

  • Bom Yun;Joonsoo Bae
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.46 no.spc
    • /
    • pp.1-8
    • /
    • 2023
  • To investigate the national science and technology policy direction in the field of construction machinery, an analysis was conducted on projects selected as national research and development (R&D) initiatives by the government. Assuming that the project titles contain key keywords, text mining was employed to substantiate this assumption. Project information data spanning nine years from 2014 to 2022 was collected through the National Science & Technology Information Service (NTIS). To observe changes over time, the years were divided into three-year sections. To analyze research trends efficiently, keywords were categorized into groups: 'equipment,' 'smart,' and 'eco-friendly.' Based on the collected data, keyword frequency analysis, N-gram analysis, and topic modeling were performed. The research findings indicate that domestic government R&D in the construction machinery field primarily focuses on smart-related research and development. Specifically, investments in monitoring systems and autonomous operation technologies are increasing. This study holds significance in analyzing objective research trends through the utilization of big data analysis techniques and is expected to contribute to future research and development planning, strategic formulation, and project management.

Implementation of Voice Control on PDA using the Text Independent Vocabulary Recognizer (가변어휘 인식기를 이용한 PDA상에서의 음성제어 구현)

  • Kwak Sang Hun;Choi Seung Ho;Shin Do Sung;Kim Jin Young
    • MALSORI
    • /
    • no.43
    • /
    • pp.57-72
    • /
    • 2002
  • The technology of speech recognition has a wide field of application. The range of such technology is spreading into mobile computing having the large amount of movement for communication equipments at the present time. Particularly, recognition in internet environment is rapidly moving into mobile environment. Because of these environments, users want the faster speed of data transmission and the lighter portable equipment for data access. That is PDA(Personal Digital Assistant). Therefore, we designed a triphone-based text independent vocabulary recognizer for the implementation of speech control in this paper. The text independent vocabulary recognizer is based on the state .joint algorithm with decision trees

  • PDF

Use of the estimated critical values adapting a regression equation for the approximate entropy test

  • Cha, Kyung-Joon;Ryu, Je-Seon
    • Journal of the Korean Data and Information Science Society
    • /
    • v.13 no.2
    • /
    • pp.77-85
    • /
    • 2002
  • The statistical testing methods have been widely recognized to determine the plain and cipher texts. In fact, the randomness for a sequence from an encryption algorithm is necessary to guarantee security and reliance of cipher algorithm. Thus, the statistical randomness tests are used to discover cipher text. In this paper, we would provide the critical value for an approximate entropy test by estimating the nonlinear regression equation when the number of sequence and the level of significance are given. Thus, we can discern plan and cipher text for real problem with given number of sequence and the level of significance. Also, we confirm the fitness of the estimated critical values from the rate of success for plain or cipher text.

  • PDF