• Title/Summary/Keyword: Text mining analysis

Search Result 1,198, Processing Time 0.027 seconds

Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents (복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 방법론)

  • Park, Jongin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.19-41
    • /
    • 2019
  • According to the rapidly increasing demand for text data analysis, research and investment in text mining are being actively conducted not only in academia but also in various industries. Text mining is generally conducted in two steps. In the first step, the text of the collected document is tokenized and structured to convert the original document into a computer-readable form. In the second step, tasks such as document classification, clustering, and topic modeling are conducted according to the purpose of analysis. Until recently, text mining-related studies have been focused on the application of the second steps, such as document classification, clustering, and topic modeling. However, with the discovery that the text structuring process substantially influences the quality of the analysis results, various embedding methods have actively been studied to improve the quality of analysis results by preserving the meaning of words and documents in the process of representing text data as vectors. Unlike structured data, which can be directly applied to a variety of operations and traditional analysis techniques, Unstructured text should be preceded by a structuring task that transforms the original document into a form that the computer can understand before analysis. It is called "Embedding" that arbitrary objects are mapped to a specific dimension space while maintaining algebraic properties for structuring the text data. Recently, attempts have been made to embed not only words but also sentences, paragraphs, and entire documents in various aspects. Particularly, with the demand for analysis of document embedding increases rapidly, many algorithms have been developed to support it. Among them, doc2Vec which extends word2Vec and embeds each document into one vector is most widely used. However, the traditional document embedding method represented by doc2Vec generates a vector for each document using the whole corpus included in the document. This causes a limit that the document vector is affected by not only core words but also miscellaneous words. Additionally, the traditional document embedding schemes usually map each document into a single corresponding vector. Therefore, it is difficult to represent a complex document with multiple subjects into a single vector accurately using the traditional approach. In this paper, we propose a new multi-vector document embedding method to overcome these limitations of the traditional document embedding methods. This study targets documents that explicitly separate body content and keywords. In the case of a document without keywords, this method can be applied after extract keywords through various analysis methods. However, since this is not the core subject of the proposed method, we introduce the process of applying the proposed method to documents that predefine keywords in the text. The proposed method consists of (1) Parsing, (2) Word Embedding, (3) Keyword Vector Extraction, (4) Keyword Clustering, and (5) Multiple-Vector Generation. The specific process is as follows. all text in a document is tokenized and each token is represented as a vector having N-dimensional real value through word embedding. After that, to overcome the limitations of the traditional document embedding method that is affected by not only the core word but also the miscellaneous words, vectors corresponding to the keywords of each document are extracted and make up sets of keyword vector for each document. Next, clustering is conducted on a set of keywords for each document to identify multiple subjects included in the document. Finally, a Multi-vector is generated from vectors of keywords constituting each cluster. The experiments for 3.147 academic papers revealed that the single vector-based traditional approach cannot properly map complex documents because of interference among subjects in each vector. With the proposed multi-vector based method, we ascertained that complex documents can be vectorized more accurately by eliminating the interference among subjects.

Analysis on Research Trend of Productivity Using Text Mining - Focusing on KSCE Journal - (텍스트 마이닝을 통한 건설 생산성 분야의 연구동향 분석 - KSCE 저널을 중심으로 -)

  • Gu, Bongil;Huh, Youngki
    • Korean Journal of Construction Engineering and Management
    • /
    • v.21 no.2
    • /
    • pp.15-21
    • /
    • 2020
  • The relationship between keywords, found in all productivity related papers published in the KSCE journal for last 15 years, were analyzed in order to reveal a research trend in the area using text mining and A-Priori algorithm. As the results, it is found that the word of 'productivity' is most closely related to the words of 'work' and 'labor'. Futhermore, the word is somewhat related to those of 'factor', 'model', simulation', and 'work time'. It is also revealed that, on the other hand, the words of 'machine' and 'equipment' have little relationships with the keyword. This research will be a great help for academia to understand a research trend in the area of construction productivity.

The Research Trends and Keywords Modeling of Shoulder Rehabilitation using the Text-mining Technique (텍스트 마이닝 기법을 활용한 어깨 재활 연구분야 동향과 키워드 모델링)

  • Kim, Jun-hee;Jung, Sung-hoon;Hwang, Ui-jae
    • Journal of the Korean Society of Physical Medicine
    • /
    • v.16 no.2
    • /
    • pp.91-100
    • /
    • 2021
  • PURPOSE: This study analyzed the trends and characteristics of shoulder rehabilitation research through keyword analysis, and their relationships were modeled using text mining techniques. METHODS: Abstract data of 10,121 articles in which abstracts were registered on the MEDLINE of PubMed with 'shoulder' and 'rehabilitation' as keywords were collected using python. By analyzing the frequency of words, 10 keywords were selected in the order of the highest frequency. Word-embedding was performed using the word2vec technique to analyze the similarity of words. In addition, the groups were classified and analyzed based on the distance (cosine similarity) through the t-SNE technique. RESULTS: The number of studies related to shoulder rehabilitation is increasing year after year, keywords most frequently used in relation to shoulder rehabilitation studies are 'patient', 'pain', and 'treatment'. The word2vec results showed that the words were highly correlated with 12 keywords from studies related to shoulder rehabilitation. Furthermore, through t-SNE, the keywords of the studies were divided into 5 groups. CONCLUSION: This study was the first study to model the keywords and their relationships that make up the abstracts of research in the MEDLINE of Pub Med related to 'shoulder' and 'rehabilitation' using text-mining techniques. The results of this study will help increase the diversifying research topics of shoulder rehabilitation studies to be conducted in the future.

Sentiment Analysis and Network Analysis based on Review Text (리뷰 텍스트 기반 감성 분석과 네트워크 분석에 관한 연구)

  • Kim, Yumi;Heo, Go Eun
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.55 no.3
    • /
    • pp.397-417
    • /
    • 2021
  • As review text contains the experience and opinions of the customers, analyzing review text helps to understand the subject. Existing studies either only used sentiment analysis on online restaurant reviews to identify the customers' assessment on different features of the restaurant or network analysis to figure out the customers' preference. In this study, we conducted both sentiment analysis and network analysis on the review text of the restaurants with high star ratings and those with low star ratings. We compared the review text of the two groups to distinguish the difference of the two and identify what makes great restaurants great.

An Exploratory Study of Happiness and Unhappiness Among Koreans based on Text Mining Techniques (텍스트마이닝 기법을 활용한 한국인의 행복과 불행 탐색연구)

  • Park, Sanghyeon;Do, Kanghyuk;Kim, Hakyeong;Park, Gaeun;Yun, Jinhyeok;Kim, Kyungil
    • The Journal of the Korea Contents Association
    • /
    • v.18 no.7
    • /
    • pp.10-27
    • /
    • 2018
  • The purpose of this study is to explore the meaning of happiness and unhappiness in Korean society through text mining analysis. Similar words with keywords(happiness/unhappiness) from online news portal are extracted using Word2Vec and TF-IDF method. We also use the K-LIWC dictionary to perform the sentiment analysis of words associated with happiness and unhappiness. In TF-IDF analysis, happiness and unhappiness are highly related to social factors and social issues of the year. In Word2Vec analysis, 'Hope' has been similar with happiness for six years. In K-LIWC analysis, 'money/financial issues', 'school', 'communication' is highly related with happiness and unhappiness. In addition, 'physical condition and symptom' is highly related to unhappiness. Implications, limitations, and suggestions for future research are also discussed.

Analyzing Architectural History Terminologies by Text Mining and Association Analysis (텍스트 마이닝과 연관 관계 분석을 이용한 건축역사 용어 분석)

  • Kim, Min-Jeong;Kim, Chul-Joo
    • Journal of Digital Convergence
    • /
    • v.15 no.1
    • /
    • pp.443-452
    • /
    • 2017
  • Architectural history traces the changes in architecture through various traditions, regions, overarching stylistic trends, and dates. This study identified terminologies related to the proximity and frequency in the architectural history areas by text mining and association analysis. This study explored terminologies by investigating articles published in the "Journal of Architectural History", a sole journal for the architectural history studies. First, key terminologies that appeared frequently were extracted from paper that had titles, keywords, and abstracts. Then, we analyzed some typical and specific key terminologies that appear frequently and partially depending on the research areas. Finally, association analysis was used to find the frequent patterns in the key terminologies. This research can be used as fundamental data for understanding issues and trends in areas on the architectural history.

An Analysis Scheme Design of Customer Spending Pattern using Text Mining (텍스트 마이닝을 이용한 소비자 소비패턴 분석 기법 설계)

  • Jeong, Eun-Hee;Lee, Byung-Kwan
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.11 no.2
    • /
    • pp.181-188
    • /
    • 2018
  • In this paper, we propose an analysis scheme of customer spending pattern using text mining. In proposed consumption pattern analysis scheme, first we analyze user's rating similarity using Pearson correlation, second we analyze user's review similarity using TF-IDF cosine similarity, third we analyze the consistency of the rating and review using Sendiwordnet. And we select the nearest neighbors using rating similarity and review similarity, and provide the recommended list that is proper with consumption pattern. The precision of recommended list are 0.79 for the Pearson correlation, 0.73 for the TF-IDF, and 0.82 for the proposed consumption pattern. That is, the proposed consumption pattern analysis scheme can more accurately analyze consumption pattern because it uses both quantitative rating and qualitative reviews of consumers.

A Study on the Consumer's Perception of HiSeoul Fashion Show Using Big Data Analysis (빅데이터 분석을 활용한 하이서울패션쇼에 대한 소비자 인식 조사)

  • Han, Ki Hyang
    • Journal of Fashion Business
    • /
    • v.23 no.5
    • /
    • pp.81-95
    • /
    • 2019
  • The purpose of this study is to research consumers' perception of the HiSeoul fashion show, which is being used by new designers as a means of promotion, and to propose a strategy for revitalizing new designer brands. This was done in order to secure basic data from fashion consumers, to help guide marketing strategies and promote rising designers. In this research, the consumers' perception of HiSeoul fashion show was verified using text-mining, data refinement and word clouding that was undertaken by TEXTOM3.0. Also, semantic network analysis, CONCOR analysis and visualization of the analysis results were performed using Ucinet 6.0 and NetDraw. "HiSeoul fashion show" was used as the keyword for text-mining and data was collected from March 1, 2018 to April 30, 2019. Using frequency analysis, TF-IDF, and N-gram, it was also shown that consumers are aware of places where shows are held, such as DDP and Igansumun. It was also revealed that consumers recognize rising designer brands, designer's names, the names of guests attending the show and the photo times. This study is meaningful in that it not only confirmed consumers' interest in new designer brands participating in the HiSeoul Fashion Show through big data but also confirmed that it is available as a marketing strategy to boost brand sales. This study suggests using HiSeoul show room to induce consumer sales, or inviting guests that match the brand image to promote them on SNS on the day the show is held for a marketing strategy.

Regional Image Change Analysis using Text Mining and Network Analysis (텍스트 마이닝과 네트워크 분석을 이용한 지역 이미지 변화 분석)

  • Jeong, Eun-Hee
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.15 no.2
    • /
    • pp.79-88
    • /
    • 2022
  • Social media big data includes a lot of information that can identify not only consumer consumption patterns but also local images. This paper was collected annually data including 'Samcheok' from 2015 to 2019 from Blog and Cafe of Naver and Daum in domestic portal site, and analyzed the regional image change after refining keyword which forms the regional image by performing text mining and network analysis. According to the research results, the regional image of 2015 was expressed with image cognitive elements of the nearby place name or place etc. such as 'Jangho Port', 'Donghae', and 'Beach'. However the regional image both 2016 and 2019 were changed with image cognitive elements of 'SamcheokSolbich' which is a special place within region. Therefore as the keywords related to the local image include 'Jangho Port' and Resort, which are the representative attractions of Samcheok, it can be seen that the infrastructure factor plays a big role in forming the local image. The significance test for the network data used the bootstrap technique, and the p-values in 2015, 2016, and 2019 were 0.0002, 0.0006, and 0.0002, respectively, which were found to be statistically significant at the significance level of 5%.

Topic Analysis of the "Right to be Forgotten" Using Text Mining (텍스트마이닝을 활용한 "잊힐 권리"의 토픽 분석)

  • Lee, So-Hyun;Koo, Bon-Jin
    • Journal of the Korean Society for information Management
    • /
    • v.39 no.2
    • /
    • pp.275-298
    • /
    • 2022
  • This study examined the issues and characteristics that appeared in news and journal articles related to the 'right to be forgotten' using text mining analysis. Data for analysis were collected from 2010 to 2020 with the keyword 'right to be forgotten'. Keyword analysis and topic modeling analysis were performed on the collected data. As a result, in the last 10 years the issues about 'right to be forgotten' are not much different in news and journal articles and the approaches also are similar. However, it confirmed common issues and the partial difference between news and journal articles through comparison. Therefore in Archives and Records Management Studies, it is necessary to discuss derived in this study. In particular common issues are considered first but if there are differences in issues, it is needed to discuss them in various ways. This study is meaningful to understand the meaning and to draw issues that may arise in the future of the 'right to be forgotten'. The results of this study will contribute to be variously discussed on the 'right to be forgotten' in Archives and Records Management Studies.