• Title/Summary/Keyword: Cosine Similarity Analysis

Search Result 81, Processing Time 0.024 seconds

A Social Network Analysis of Research Topics in Korean Nursing Science (한국 간호학 연구주제의 사회 연결망 분석)

  • Lee, Soo-Kyoung;Jeong, Senator;Kim, Hong-Gee;Yom, Young-Hee
    • Journal of Korean Academy of Nursing
    • /
    • v.41 no.5
    • /
    • pp.623-632
    • /
    • 2011
  • Purpose: This study was done to explore the knowledge structure of Korean Nursing Science. Methods: The main variables were key words from the research papers that were presented in the Journal of Korean Academy of Nursing and journals of the seven branches of the Korean Academy of Nursing. English titles and abstracts of the papers (n=5,936) published from 1995 through 2009 were included. Noun phrases were extracted from the corpora using an in-house program (BiKE Text Analyzer), and their co-occurrence networks were generated via a cosine similarity measure, and then the networks were analyzed and visualized using Pajek, a Social Network Analysis program. Results: With the hub and authority measures, the most important research topics in Korean Nursing Science were identified. Newly emerging topics by three-year period units were observed as research trends. Conclusion: This study provides a systematic overview on the knowledge structure of Korean Nursing Science. The Social Network Analysis for this study will be useful for identifying the knowledge structure in Nursing Science.

An Exploratory Study of Collective E-Petitions Estimation Methodology Using Anomaly Detection: Focusing on the Voice of Citizens of Changwon City (이상탐지 활용 전자집단민원 추정 방법론에 관한 탐색적 연구: 창원시 시민의 소리 사례를 중심으로)

  • Jeong, Ha-Yeong
    • Informatization Policy
    • /
    • v.26 no.4
    • /
    • pp.85-106
    • /
    • 2019
  • Recently, there have been increasing cases of collective petitions filed in the electronic petitions system. However, there is no efficient management system, raising concerns on side effects such as increased administrative workload and mass production of social conflicts. Aimed at suggesting a methodology for estimating electronic collective petitions using anomaly detection and corpus linguistics-based content analysis, this study conducted the followings: i) a theoretical review of the concept of collective petitions, ii) estimation of electronic collective petitions using anomaly detection based on nonparametric unsupervised learning, iii) a content similarity analysis on petitions using n-gram cosine angle distance, and iv) a case study on the Voice of Citizens of Changwon City, through which the utility of the proposed methodology, policy implications and future tasks were reviewed.

Patent Keyword Analysis for Forecasting Emerging Technology : GHG Technology (부상기술 예측을 위한 특허키워드정보분석에 관한 연구 - GHG 기술 중심으로)

  • Choe, Do Han;Kim, Gab Jo;Park, Sang Sung;Jang, Dong Sik
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.9 no.2
    • /
    • pp.139-149
    • /
    • 2013
  • As the importance of technology forecasting while countries and companies manage the R&D project is growing bigger, the methodology of technology forecasting has been diversified. One of the forecasting method is patent analysis. This research proposes quick forecasting process of emerging technology based on keyword approach using text mining. The forecasting process is following: First, the term-document matrix is extracted from patent documents by using text mining. Second, emerging technology keyword are extracted by analyzing the importance of word from utilizing mean values and standard deviation values of the term and the emerging trend of word discovered from time series information of the term. Next, association between terms is measured by using cosine similarity. finally, the keyword of emerging technology is selected in consequence of the synthesized result and we forecast the emerging technology according to the results. The technology forecasting process described in this paper can be applied to developing computerized technology forecasting system integrated with various results of other patent analysis for decision maker of company and country.

Comparative Analysis of Vectorization Techniques in Electronic Medical Records Classification (의무 기록 문서 분류를 위한 자연어 처리에서 최적의 벡터화 방법에 대한 비교 분석)

  • Yoo, Sung Lim
    • Journal of Biomedical Engineering Research
    • /
    • v.43 no.2
    • /
    • pp.109-115
    • /
    • 2022
  • Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic medical records classification. Material and methods: 403 electronic medical documents were extracted retrospectively and classified using the cosine similarity calculated by Scikit-learn (Python module for machine learning) in Jupyter Notebook. Vectors for medical documents were produced by three different vectorization techniques (TF-IDF, latent sematic analysis and Word2Vec) and the classification precisions for three vectorization techniques were evaluated. The Kruskal-Wallis test was used to determine if there was a significant difference among three vectorization techniques. Results: 403 medical documents were relevant to 41 different diseases and the average number of documents per diagnosis was 9.83 (standard deviation=3.46). The classification precisions for three vectorization techniques were 0.78 (TF-IDF), 0.87 (LSA) and 0.79 (Word2Vec). There was a statistically significant difference among three vectorization techniques. Conclusions: The results suggest that removing irrelevant information (LSA) is more efficient vectorization technique than modifying weights of vectorization models (TF-IDF, Word2Vec) for medical documents classification.

Research on Business Job Specification through Employment Information Analysis (채용정보 분석을 통한 비즈니스 직무 스펙 연구)

  • Lee, Jong Hwa;Lee, Hyun Kyu
    • The Journal of Information Systems
    • /
    • v.31 no.1
    • /
    • pp.271-287
    • /
    • 2022
  • Purpose This research aims to study the changes in recruitment needed for the growth and survival of companies in the rapidly changing industry. In particular, we built a real company's worklist accounting for the rapidly advancing data-driven digital transformation, and presented the capabilities and conditions required for work. Design/methodology/approach we selected 37 jobs based on NCS to develop the employment search requirements by analyzing the business characteristics and work capabilities of the industry and company. The business specification indicators were converted into a matrix through the TF-IDF process, and the NMF algorithm is used to extract the features of each document. Also, the cosine distance measurement method is utilized to determine the similarity of the job specification conditions. Findings Companies tended to prefer "IT competency," which is a specification related to computer use and certification, and "experience competency," which is a specification for experience and internship. In addition, 'foreign language competency' was additionally preferred depending on the job. This analysis and development of job requirements would not only help companies to find the talents but also be useful for the jobseekers to easily decide the priority of their specification activities.

A Study on Market Size Estimation Method by Product Group Using Word2Vec Algorithm (Word2Vec을 활용한 제품군별 시장규모 추정 방법에 관한 연구)

  • Jung, Ye Lim;Kim, Ji Hui;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.1-21
    • /
    • 2020
  • With the rapid development of artificial intelligence technology, various techniques have been developed to extract meaningful information from unstructured text data which constitutes a large portion of big data. Over the past decades, text mining technologies have been utilized in various industries for practical applications. In the field of business intelligence, it has been employed to discover new market and/or technology opportunities and support rational decision making of business participants. The market information such as market size, market growth rate, and market share is essential for setting companies' business strategies. There has been a continuous demand in various fields for specific product level-market information. However, the information has been generally provided at industry level or broad categories based on classification standards, making it difficult to obtain specific and proper information. In this regard, we propose a new methodology that can estimate the market sizes of product groups at more detailed levels than that of previously offered. We applied Word2Vec algorithm, a neural network based semantic word embedding model, to enable automatic market size estimation from individual companies' product information in a bottom-up manner. The overall process is as follows: First, the data related to product information is collected, refined, and restructured into suitable form for applying Word2Vec model. Next, the preprocessed data is embedded into vector space by Word2Vec and then the product groups are derived by extracting similar products names based on cosine similarity calculation. Finally, the sales data on the extracted products is summated to estimate the market size of the product groups. As an experimental data, text data of product names from Statistics Korea's microdata (345,103 cases) were mapped in multidimensional vector space by Word2Vec training. We performed parameters optimization for training and then applied vector dimension of 300 and window size of 15 as optimized parameters for further experiments. We employed index words of Korean Standard Industry Classification (KSIC) as a product name dataset to more efficiently cluster product groups. The product names which are similar to KSIC indexes were extracted based on cosine similarity. The market size of extracted products as one product category was calculated from individual companies' sales data. The market sizes of 11,654 specific product lines were automatically estimated by the proposed model. For the performance verification, the results were compared with actual market size of some items. The Pearson's correlation coefficient was 0.513. Our approach has several advantages differing from the previous studies. First, text mining and machine learning techniques were applied for the first time on market size estimation, overcoming the limitations of traditional sampling based- or multiple assumption required-methods. In addition, the level of market category can be easily and efficiently adjusted according to the purpose of information use by changing cosine similarity threshold. Furthermore, it has a high potential of practical applications since it can resolve unmet needs for detailed market size information in public and private sectors. Specifically, it can be utilized in technology evaluation and technology commercialization support program conducted by governmental institutions, as well as business strategies consulting and market analysis report publishing by private firms. The limitation of our study is that the presented model needs to be improved in terms of accuracy and reliability. The semantic-based word embedding module can be advanced by giving a proper order in the preprocessed dataset or by combining another algorithm such as Jaccard similarity with Word2Vec. Also, the methods of product group clustering can be changed to other types of unsupervised machine learning algorithm. Our group is currently working on subsequent studies and we expect that it can further improve the performance of the conceptually proposed basic model in this study.

The Research Trends and Keywords Modeling of Shoulder Rehabilitation using the Text-mining Technique (텍스트 마이닝 기법을 활용한 어깨 재활 연구분야 동향과 키워드 모델링)

  • Kim, Jun-hee;Jung, Sung-hoon;Hwang, Ui-jae
    • Journal of the Korean Society of Physical Medicine
    • /
    • v.16 no.2
    • /
    • pp.91-100
    • /
    • 2021
  • PURPOSE: This study analyzed the trends and characteristics of shoulder rehabilitation research through keyword analysis, and their relationships were modeled using text mining techniques. METHODS: Abstract data of 10,121 articles in which abstracts were registered on the MEDLINE of PubMed with 'shoulder' and 'rehabilitation' as keywords were collected using python. By analyzing the frequency of words, 10 keywords were selected in the order of the highest frequency. Word-embedding was performed using the word2vec technique to analyze the similarity of words. In addition, the groups were classified and analyzed based on the distance (cosine similarity) through the t-SNE technique. RESULTS: The number of studies related to shoulder rehabilitation is increasing year after year, keywords most frequently used in relation to shoulder rehabilitation studies are 'patient', 'pain', and 'treatment'. The word2vec results showed that the words were highly correlated with 12 keywords from studies related to shoulder rehabilitation. Furthermore, through t-SNE, the keywords of the studies were divided into 5 groups. CONCLUSION: This study was the first study to model the keywords and their relationships that make up the abstracts of research in the MEDLINE of Pub Med related to 'shoulder' and 'rehabilitation' using text-mining techniques. The results of this study will help increase the diversifying research topics of shoulder rehabilitation studies to be conducted in the future.

Exploration of Hierarchical Techniques for Clustering Korean Author Names (한글 저자명 군집화를 위한 계층적 기법 비교)

  • Kang, In-Su
    • Journal of Information Management
    • /
    • v.40 no.2
    • /
    • pp.95-115
    • /
    • 2009
  • Author resolution is to disambiguate same-name author occurrences into real individuals. For this, pair-wise author similarities are computed for author name entities, and then clustering is performed. So far, many studies have employed hierarchical clustering techniques for author disambiguation. However, various hierarchical clustering methods have not been sufficiently investigated. This study covers an empirical evaluation and analysis of hierarchical clustering applied to Korean author resolution, using multiple distance functions such as Dice coefficient, Cosine similarity, Euclidean distance, Jaccard coefficient, Pearson correlation coefficient.

Redundant and Abnormal Data Processing Scheme in Large-scale IoT Environment (대규모 IoT 환경에서의 중복 및 비정상 데이터 처리 기법)

  • Kim, Min-Woo;Lee, Tae-Ho;Lee, Byung-Jun;Kim, Kyung-Tae;Youn, Hee-Yong
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2019.07a
    • /
    • pp.109-110
    • /
    • 2019
  • 최근 IoT 환경에서는 고밀도로 노드가 분포되어진다. 이러한 센서 노드들은 데이터 전송 시 혼잡을 초래하는 중복 데이터를 생성하여 데이터의 정확도를 저하시킨다. 이에 따라 본 연구에서는 데이터 집중으로 인해 발생하는 네트워크의 정체 문제를 해결하기 위해 제안 기법은 사 분위(Interquatile, IRQ) 분석과 코사인 유사도 함수를 통해 데이터의 이상치와 중복성을 측정하여 중복 데이터 및 특이치를 제거한다. 본 연구를 통하여 최적의 데이터 전송을 통하여 IoT의 통신 성능을 향상시킬 수 있으며 결과적으로 데이터 감소율, 네트워크 수명 및 에너지의 효율성을 높일 수 있다.

  • PDF

A Bibliographic Study on the Calvin Theological Journal (칼빈 신학교 학술지에 대한 계량서지학적 분석에 관한 연구)

  • Yoo, Yeong Jun;Lee, Jae Yun
    • Journal of the Korean BIBLIA Society for library and Information Science
    • /
    • v.27 no.4
    • /
    • pp.125-145
    • /
    • 2016
  • This study aimed at finding theological trends of Calvin Theological Journal by analyzing Library of Congress Subject Headings (LCSH). The study performed the time-series analysis and the analysis of distinctive terms by examining the main authors and the subject headings of the articles published in Calvin Theological Journal during 45 years. We also proposed a new method of dividing the analysis period with the change of authors and subject headings. In the analysis results, the 18 main authors had the three clusters and shared Calvin and the Reformed Theology, the Bible. The reformed characteristics were shown in the first and second period, but the reformed theology was at the margins. The frequency of Calvin became small in the third period, the frequency of the reformed theology became bigger than before, but it was at the perimeters. Literary criticism was clustered independently. There were lots of the terms of the reformed theology in the analysis of the distinctive terms in all three periods and especially in the 2-1 period science and religion were included as the distinctive terms. Therefore, the theological tendency of the Calvin Theological Journal seemed the reformed theology and Old Testament.