• Title/Summary/Keyword: Text data

Search Result 2,953, Processing Time 0.027 seconds

A Study on Market Size Estimation Method by Product Group Using Word2Vec Algorithm (Word2Vec을 활용한 제품군별 시장규모 추정 방법에 관한 연구)

  • Jung, Ye Lim;Kim, Ji Hui;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.1-21
    • /
    • 2020
  • With the rapid development of artificial intelligence technology, various techniques have been developed to extract meaningful information from unstructured text data which constitutes a large portion of big data. Over the past decades, text mining technologies have been utilized in various industries for practical applications. In the field of business intelligence, it has been employed to discover new market and/or technology opportunities and support rational decision making of business participants. The market information such as market size, market growth rate, and market share is essential for setting companies' business strategies. There has been a continuous demand in various fields for specific product level-market information. However, the information has been generally provided at industry level or broad categories based on classification standards, making it difficult to obtain specific and proper information. In this regard, we propose a new methodology that can estimate the market sizes of product groups at more detailed levels than that of previously offered. We applied Word2Vec algorithm, a neural network based semantic word embedding model, to enable automatic market size estimation from individual companies' product information in a bottom-up manner. The overall process is as follows: First, the data related to product information is collected, refined, and restructured into suitable form for applying Word2Vec model. Next, the preprocessed data is embedded into vector space by Word2Vec and then the product groups are derived by extracting similar products names based on cosine similarity calculation. Finally, the sales data on the extracted products is summated to estimate the market size of the product groups. As an experimental data, text data of product names from Statistics Korea's microdata (345,103 cases) were mapped in multidimensional vector space by Word2Vec training. We performed parameters optimization for training and then applied vector dimension of 300 and window size of 15 as optimized parameters for further experiments. We employed index words of Korean Standard Industry Classification (KSIC) as a product name dataset to more efficiently cluster product groups. The product names which are similar to KSIC indexes were extracted based on cosine similarity. The market size of extracted products as one product category was calculated from individual companies' sales data. The market sizes of 11,654 specific product lines were automatically estimated by the proposed model. For the performance verification, the results were compared with actual market size of some items. The Pearson's correlation coefficient was 0.513. Our approach has several advantages differing from the previous studies. First, text mining and machine learning techniques were applied for the first time on market size estimation, overcoming the limitations of traditional sampling based- or multiple assumption required-methods. In addition, the level of market category can be easily and efficiently adjusted according to the purpose of information use by changing cosine similarity threshold. Furthermore, it has a high potential of practical applications since it can resolve unmet needs for detailed market size information in public and private sectors. Specifically, it can be utilized in technology evaluation and technology commercialization support program conducted by governmental institutions, as well as business strategies consulting and market analysis report publishing by private firms. The limitation of our study is that the presented model needs to be improved in terms of accuracy and reliability. The semantic-based word embedding module can be advanced by giving a proper order in the preprocessed dataset or by combining another algorithm such as Jaccard similarity with Word2Vec. Also, the methods of product group clustering can be changed to other types of unsupervised machine learning algorithm. Our group is currently working on subsequent studies and we expect that it can further improve the performance of the conceptually proposed basic model in this study.

A System for Automatic Classification of Traditional Culture Texts (전통문화 콘텐츠 표준체계를 활용한 자동 텍스트 분류 시스템)

  • Hur, YunA;Lee, DongYub;Kim, Kuekyeng;Yu, Wonhee;Lim, HeuiSeok
    • Journal of the Korea Convergence Society
    • /
    • v.8 no.12
    • /
    • pp.39-47
    • /
    • 2017
  • The Internet have increased the number of digital web documents related to the history and traditions of Korean Culture. However, users who search for creators or materials related to traditional cultures are not able to get the information they want and the results are not enough. Document classification is required to access this effective information. In the past, document classification has been difficult to manually and manually classify documents, but it has recently been difficult to spend a lot of time and money. Therefore, this paper develops an automatic text classification model of traditional cultural contents based on the data of the Korean information culture field composed of systematic classifications of traditional cultural contents. This study applied TF-IDF model, Bag-of-Words model, and TF-IDF/Bag-of-Words combined model to extract word frequencies for 'Korea Traditional Culture' data. And we developed the automatic text classification model of traditional cultural contents using Support Vector Machine classification algorithm.

Automatic Training Corpus Generation Method of Named Entity Recognition Using Knowledge-Bases (개체명 인식 코퍼스 생성을 위한 지식베이스 활용 기법)

  • Park, Youngmin;Kim, Yejin;Kang, Sangwoo;Seo, Jungyun
    • Korean Journal of Cognitive Science
    • /
    • v.27 no.1
    • /
    • pp.27-41
    • /
    • 2016
  • Named entity recognition is to classify elements in text into predefined categories and used for various departments which receives natural language inputs. In this paper, we propose a method which can generate named entity training corpus automatically using knowledge bases. We apply two different methods to generate corpus depending on the knowledge bases. One of the methods attaches named entity labels to text data using Wikipedia. The other method crawls data from web and labels named entities to web text data using Freebase. We conduct two experiments to evaluate corpus quality and our proposed method for generating Named entity recognition corpus automatically. We extract sentences randomly from two corpus which called Wikipedia corpus and Web corpus then label them to validate both automatic labeled corpus. We also show the performance of named entity recognizer trained by corpus generated in our proposed method. The result shows that our proposed method adapts well with new corpus which reflects diverse sentence structures and the newest entities.

  • PDF

Military Security Policy Research Using Big Data and Text Mining (빅데이터와 텍스트마이닝 기법을 활용한 군사보안정책 탐구)

  • Kim, Doo Hwan;Park, Ho Jeong
    • Convergence Security Journal
    • /
    • v.19 no.4
    • /
    • pp.23-34
    • /
    • 2019
  • This study utilized big data, one of the new technologies of the Fourth Industrial Revolution as a policy direction study related to the military security of the Army. By utilizing Text mining and analyzing military security trends in domestic and foreign papers, it will be able to set policy directions and reduce trial and error. In this study, we found differences in domestic and international studies on military sucurity. At first, Domestic research has shown that in the course of the fourth industrial revolution, there is a strong interest in technological security, such as IT technology in security and cyber security in North Korea. On the other hand, Foreign research confirmed that policies are being studied in such a way that military sucurity is needed at the level of cooperation between countries and that it can contribute to world peace. Various academic policy studies have been underway in terms of determining world peace and security levels, not just security levels. It contrasted in our immediate confrontation with North Korea for decades but suggest complementary measures that cannot be overlooked from a grand perspective. Conclusionally, the direction of academic research in domestic and foreign should be done in macro perspective under national network cooperation, not just technology sucurity research, recognizing that military security is a policy product that should be studied in a security system between countries.

Analysis of Home Economics Curriculum Using Text Mining Techniques (텍스트 마이닝 기법을 활용한 중학교 가정과 교육과정 분석)

  • Lee, Gi-Sen;Lim, So-Jin;Choi, Yoo-ri;Kim, Eun-Jong;Lee, So-Young;Park, Mi-Jeong
    • Journal of Korean Home Economics Education Association
    • /
    • v.30 no.3
    • /
    • pp.111-127
    • /
    • 2018
  • The purpose of this study was to analysis the home economics education curriculum from the first national curriculum to the 2015 revised curriculum using text mining techniques used in big data analysis. The subjects of the analysis were 10 curriculum texts from the first national curriculum to the 2015 revised curriculum via the National Curriculum Information Center. The major findings of this study were as follows; First, the number of data from the 4th curriculum to the 2015 revised curriculum gradually increased. Second, as a result of extracting core concept of the curriculum, there were core concept words that were changed and maintained according to the curriculum. 'Life' and 'home' were core concepts that persisted regardless of changes in the curriculum, after the 2007 revised curriculum, 'problem', 'ability', 'solution' and 'practice' were emphasized. Third, through core concept network analysis for each curriculum, the relationship between core concepts is represented by nodes and lines in each home economics curriculum. As a result, it was confirmed that the core concepts emphasized by the times are strongly connected with 'life' and 'home'. Based on these results, this study is meaningful in that it provides basic data to form the identity and the existing direction of home economics education.

Research on text mining based malware analysis technology using string information (문자열 정보를 활용한 텍스트 마이닝 기반 악성코드 분석 기술 연구)

  • Ha, Ji-hee;Lee, Tae-jin
    • Journal of Internet Computing and Services
    • /
    • v.21 no.1
    • /
    • pp.45-55
    • /
    • 2020
  • Due to the development of information and communication technology, the number of new / variant malicious codes is increasing rapidly every year, and various types of malicious codes are spreading due to the development of Internet of things and cloud computing technology. In this paper, we propose a malware analysis method based on string information that can be used regardless of operating system environment and represents library call information related to malicious behavior. Attackers can easily create malware using existing code or by using automated authoring tools, and the generated malware operates in a similar way to existing malware. Since most of the strings that can be extracted from malicious code are composed of information closely related to malicious behavior, it is processed by weighting data features using text mining based method to extract them as effective features for malware analysis. Based on the processed data, a model is constructed using various machine learning algorithms to perform experiments on detection of malicious status and classification of malicious groups. Data has been compared and verified against all files used on Windows and Linux operating systems. The accuracy of malicious detection is about 93.5%, the accuracy of group classification is about 90%. The proposed technique has a wide range of applications because it is relatively simple, fast, and operating system independent as a single model because it is not necessary to build a model for each group when classifying malicious groups. In addition, since the string information is extracted through static analysis, it can be processed faster than the analysis method that directly executes the code.

E-mail System Providing Integrated User's View for the Message containing Image and Text (이미지와 텍스트 메시지의 통합 사용자 뷰를 제공하는 전자 우편 시스템)

  • Dok-Go, Se-Jun;Lee, Taek-Gyun;Lee, Hyeong-U;Yun, Seong-Hyeon;Lee, Seong-Hwan;Kim, Chang-Heon;Kim, Tae-Yun
    • The Transactions of the Korea Information Processing Society
    • /
    • v.4 no.2
    • /
    • pp.563-572
    • /
    • 1997
  • E-mail has been eidely used for unformation delivery as an Inernet serive. As multimedia etchnologies are developed rapidly, most of the recent Unternet infornation servies support multimedia data. E-mail system also needs to suport multimedia nesage. But Internet mail servise using simple maiol transfer protocol(SMTP) speci-fied in RFC 821/822 handles only ASCII text messages repressented with 7-bit code. Each line the message has the length limitation as well. Those are why it cannot satisfy the diverse user'w demands. Multipuepose Unternet mail extensions(MIMZE), which is a modification and supplement of RFC 822,was proposed for supporting transportation of multimedia data.It can solve the limitations of sizes and types in contents of a message. In this study the E-mail system has been designed and implemented according to the MIME standard in order to solve the limitations of transpotation of messages regardless of the message content type. Hypertext markup language(HTML)syntax is applied to the mail system, and so it is possible to display a message consisting of differnt media as an intergrated from for the purpose of better understanding a message. No application program is needed for displaying a message including image data,and convenience for user is considered in the system. The futuer work is to improve the E-mail system so that it may support motion pictures and sound information,Thereby tge perfor multimeda E-mail system providing inergrated user's wiew will be developed.

  • PDF

Assessment of Public Awareness on Invasive Alien Species of Freshwater Ecosystem Using Conservation Culturomics (보전문화체학 접근방식을 통한 생태계교란 생물인 담수 외래종의 대중인식 평가)

  • Park, Woong-Bae;Do, Yuno
    • Journal of Wetlands Research
    • /
    • v.23 no.4
    • /
    • pp.364-371
    • /
    • 2021
  • Public awareness of alien species can vary by generation, period, or specific events associated with these species. An understanding of public awareness is important for the management of alien species because differences in public awareness can affect the establishment and implementation of management plans. We analyzed digital texts on social media platforms, news articles, and internet search volumes used in conservation culturomics to understand public interest and sentiment regarding alien freshwater species. The number of tweets, number of news articles, and relative search volume to 11 freshwater alien species were extracted to determine public interest. Additionally, the trend over time, seasonal variability, and repetition period of these data were confirmed. We also calculated the sentiment score and analyzed public sentiment in the collected data using sentiment analysis based on text mining techniques. The American bullfrog, nutria, bluegill, and largemouth bass drew relatively more public interest than other species. Some species showed repeated patterns in the number of Twitter posts, media coverage, and internet searches found according to the specified periods. The text mining analysis results showed negative sentiments from most people regarding alien freshwater species. Particularly, negative sentiments increased over the years after alien species were designated as ecologically disturbing species.

Multi-modal Image Processing for Improving Recognition Accuracy of Text Data in Images (이미지 내의 텍스트 데이터 인식 정확도 향상을 위한 멀티 모달 이미지 처리 프로세스)

  • Park, Jungeun;Joo, Gyeongdon;Kim, Chulyun
    • Database Research
    • /
    • v.34 no.3
    • /
    • pp.148-158
    • /
    • 2018
  • The optical character recognition (OCR) is a technique to extract and recognize texts from images. It is an important preprocessing step in data analysis since most actual text information is embedded in images. Many OCR engines have high recognition accuracy for images where texts are clearly separable from background, such as white background and black lettering. However, they have low recognition accuracy for images where texts are not easily separable from complex background. To improve this low accuracy problem with complex images, it is necessary to transform the input image to make texts more noticeable. In this paper, we propose a method to segment an input image into text lines to enable OCR engines to recognize each line more efficiently, and to determine the final output by comparing the recognition rates of CLAHE module and Two-step module which distinguish texts from background regions based on image processing techniques. Through thorough experiments comparing with well-known OCR engines, Tesseract and Abbyy, we show that our proposed method have the best recognition accuracy with complex background images.

Analysis of the Status of Natural Language Processing Technology Based on Deep Learning (딥러닝 중심의 자연어 처리 기술 현황 분석)

  • Park, Sang-Un
    • The Journal of Bigdata
    • /
    • v.6 no.1
    • /
    • pp.63-81
    • /
    • 2021
  • The performance of natural language processing is rapidly improving due to the recent development and application of machine learning and deep learning technologies, and as a result, the field of application is expanding. In particular, as the demand for analysis on unstructured text data increases, interest in NLP(Natural Language Processing) is also increasing. However, due to the complexity and difficulty of the natural language preprocessing process and machine learning and deep learning theories, there are still high barriers to the use of natural language processing. In this paper, for an overall understanding of NLP, by examining the main fields of NLP that are currently being actively researched and the current state of major technologies centered on machine learning and deep learning, We want to provide a foundation to understand and utilize NLP more easily. Therefore, we investigated the change of NLP in AI(artificial intelligence) through the changes of the taxonomy of AI technology. The main areas of NLP which consists of language model, text classification, text generation, document summarization, question answering and machine translation were explained with state of the art deep learning models. In addition, major deep learning models utilized in NLP were explained, and data sets and evaluation measures for performance evaluation were summarized. We hope researchers who want to utilize NLP for various purposes in their field be able to understand the overall technical status and the main technologies of NLP through this paper.