• Title/Summary/Keyword: Document Frequency

Search Result 303, Processing Time 0.02 seconds

A Text Similarity Measurement Method Based on Singular Value Decomposition and Semantic Relevance

  • Li, Xu;Yao, Chunlong;Fan, Fenglong;Yu, Xiaoqiang
    • Journal of Information Processing Systems
    • /
    • v.13 no.4
    • /
    • pp.863-875
    • /
    • 2017
  • The traditional text similarity measurement methods based on word frequency vector ignore the semantic relationships between words, which has become the obstacle to text similarity calculation, together with the high-dimensionality and sparsity of document vector. To address the problems, the improved singular value decomposition is used to reduce dimensionality and remove noises of the text representation model. The optimal number of singular values is analyzed and the semantic relevance between words can be calculated in constructed semantic space. An inverted index construction algorithm and the similarity definitions between vectors are proposed to calculate the similarity between two documents on the semantic level. The experimental results on benchmark corpus demonstrate that the proposed method promotes the evaluation metrics of F-measure.

DOCST: Document frequency Oriented Clustering for Short Texts (가중치를 이용한 효과적인 항공 단문 군집 방법)

  • Kim, Jooyoung;Lee, Jimin;An, Soonhong;Lee, Hoonsuk
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2018.05a
    • /
    • pp.331-334
    • /
    • 2018
  • 비정형 데이터의 대표적인 형태 중 하나인 텍스트 데이터 기계학습은 다양한 산업군에서 활용되고 있다. NOTAM 은 하루에 수 천개씩 생성되는 항공전문으로써 현재는 사람의 수작업으로 분석하고 있다. 기계학습을 통해 업무 효율성을 기대할 수 있는 반면, 축약어가 혼재된 단문이라는 데이터의 특성상 일반적인 분석에 어려움이 있다. 본 연구에서는, 데이터의 크기가 크지 않고, 축약어가 혼재되어 있으며, 문장의 길이가 매우 짧은 문서들을 군집화하는 방법을 제안한다. 주제를 기준으로 문서를 분류하는 LDA 와, 단어를 k 차원의 벡터공간에 표현하는 Word2Vec 를 활용하여 잡음이 포함된 단문 데이터에서도 효율적으로 문서를 군집화 할 수 있다.

Keyword identifications on dimensions for service quality of Healthcare providers (헬스케어 서비스 리뷰를 활용한 서비스 품질 차원 별 중요 단어 파악 방안)

  • Lee, Hong Joo
    • Knowledge Management Research
    • /
    • v.19 no.4
    • /
    • pp.171-185
    • /
    • 2018
  • Studies on online review have carried out analysis of the rating and topic as a whole. However, it is necessary to analyze opinions on various dimensions of service quality. This study classifies reviews of healthcare services into service quality dimensions, and proposes a method to identify words that are mainly referred to in each dimension. Service quality was based on the dimensions provided by SERVQUAL, and patient reviews have collected from NHSChoice. The 2,000 sentences sampled were classified into service quality dimension of SERVQUAL and a method of extracting important keywords from sentences by service quality dimension was suggested. The RAKE algorithm is used to extract key words from a single document and an index is considered to consider frequently used words in various documents. Since we need to identify key words in various reviews, we have considered frequency and discrimination (IDF) at the same time, rather than identifying key words based only on the RAKE score. In SERVQUAL dimension, we identified the words that patients mentioned mainly, and also identified the words that patients mainly refer to by review rating.

Research trends in dental hygiene based on topic modeling and semantic network analysis

  • Yun-Jeong Kim;Jae-Hee Roh
    • Journal of Korean society of Dental Hygiene
    • /
    • v.22 no.6
    • /
    • pp.495-502
    • /
    • 2022
  • Objectives: The purpose of this study was to analyze research trends in dental hygiene using topic modeling and semantic network analysis. Methods: A total of 261 published studies were collected 686 key words from the Research Information Sharing Service (RISS) by 2019-2021. Topic modeling and semantic network analysis were performed using Textom. Results: The most frequently and frequency-inverse document frequently key words were 'dental hygienist', 'oral health', 'elderly', 'periodontal disease', 'dental hygiene'. N-gram of key words show that 'dental hygienist-emotional labor', 'dental hygienist-elderly', 'dental hygienist-job performance', 'oral health-quality of life', 'oral health-periodontal disease' etc. were frequently. Key words with high degree centrality were 'dental hygienist (0.317)', 'oral health (0.239)', 'elderly (0.127)', 'job satisfaction (0.057)', 'dental care (0.049)'. Extracted topics were 5 by topic modeling. Conclusions: Results from the current study could be available to know research trends in dental hygiene and it is necessary to improve more detailed and qualitative analysis in follow-up study.

A Study on the Blockchain based Frequency Allocation Process for Private 5G (블록체인 기반 5G 특화망 주파수 할당 프로세스 연구)

  • Won-Seok Yoo;Won-Cheol Lee
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.16 no.1
    • /
    • pp.24-32
    • /
    • 2023
  • The current Private 5G use procedure goes through the step of application examination, use and usage inspection, and can be divided in to application, examination step as a procedure before frequency allocation, and use, usage inspection step as a procedure after frequency allocation. Various types of documents are required to apply for a Private 5G, and due to the document screening process and radio station inspection for using Private 5G frequencies, the procedure for Private 5G applicants to use Private 5G is complicated and takes a considerable amount of time. In this paper, we proposed Frequency Allocation Process for Private 5G using a blockchain platform, which is fast and simplified than the current procedure. Through the use of a blockchain platform and NFT (Non-Fungible Token), reliability and integrity of the data required in the frequency allocation process were secured, and security of frequency usage information was maintained and a reliable Private 5G frequency allocation process was established. Also by applying the RPA system that minimizes human intervention, fairness was secured in the process of allocating Private 5G. Finally, the frequency allocation process of Private 5G based on the Ethereum blockchain was performed though a simulation.

A Collaborative Filtering System Combined with Users' Review Mining : Application to the Recommendation of Smartphone Apps (사용자 리뷰 마이닝을 결합한 협업 필터링 시스템: 스마트폰 앱 추천에의 응용)

  • Jeon, ByeoungKug;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.2
    • /
    • pp.1-18
    • /
    • 2015
  • Collaborative filtering(CF) algorithm has been popularly used for recommender systems in both academic and practical applications. A general CF system compares users based on how similar they are, and creates recommendation results with the items favored by other people with similar tastes. Thus, it is very important for CF to measure the similarities between users because the recommendation quality depends on it. In most cases, users' explicit numeric ratings of items(i.e. quantitative information) have only been used to calculate the similarities between users in CF. However, several studies indicated that qualitative information such as user's reviews on the items may contribute to measure these similarities more accurately. Considering that a lot of people are likely to share their honest opinion on the items they purchased recently due to the advent of the Web 2.0, user's reviews can be regarded as the informative source for identifying user's preference with accuracy. Under this background, this study proposes a new hybrid recommender system that combines with users' review mining. Our proposed system is based on conventional memory-based CF, but it is designed to use both user's numeric ratings and his/her text reviews on the items when calculating similarities between users. In specific, our system creates not only user-item rating matrix, but also user-item review term matrix. Then, it calculates rating similarity and review similarity from each matrix, and calculates the final user-to-user similarity based on these two similarities(i.e. rating and review similarities). As the methods for calculating review similarity between users, we proposed two alternatives - one is to use the frequency of the commonly used terms, and the other one is to use the sum of the importance weights of the commonly used terms in users' review. In the case of the importance weights of terms, we proposed the use of average TF-IDF(Term Frequency - Inverse Document Frequency) weights. To validate the applicability of the proposed system, we applied it to the implementation of a recommender system for smartphone applications (hereafter, app). At present, over a million apps are offered in each app stores operated by Google and Apple. Due to this information overload, users have difficulty in selecting proper apps that they really want. Furthermore, app store operators like Google and Apple have cumulated huge amount of users' reviews on apps until now. Thus, we chose smartphone app stores as the application domain of our system. In order to collect the experimental data set, we built and operated a Web-based data collection system for about two weeks. As a result, we could obtain 1,246 valid responses(ratings and reviews) from 78 users. The experimental system was implemented using Microsoft Visual Basic for Applications(VBA) and SAS Text Miner. And, to avoid distortion due to human intervention, we did not adopt any refining works by human during the user's review mining process. To examine the effectiveness of the proposed system, we compared its performance to the performance of conventional CF system. The performances of recommender systems were evaluated by using average MAE(mean absolute error). The experimental results showed that our proposed system(MAE = 0.7867 ~ 0.7881) slightly outperformed a conventional CF system(MAE = 0.7939). Also, they showed that the calculation of review similarity between users based on the TF-IDF weights(MAE = 0.7867) leaded to better recommendation accuracy than the calculation based on the frequency of the commonly used terms in reviews(MAE = 0.7881). The results from paired samples t-test presented that our proposed system with review similarity calculation using the frequency of the commonly used terms outperformed conventional CF system with 10% statistical significance level. Our study sheds a light on the application of users' review information for facilitating electronic commerce by recommending proper items to users.

Investigations on Techniques and Applications of Text Analytics (텍스트 분석 기술 및 활용 동향)

  • Kim, Namgyu;Lee, Donghoon;Choi, Hochang;Wong, William Xiu Shun
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.42 no.2
    • /
    • pp.471-492
    • /
    • 2017
  • The demand and interest in big data analytics are increasing rapidly. The concepts around big data include not only existing structured data, but also various kinds of unstructured data such as text, images, videos, and logs. Among the various types of unstructured data, text data have gained particular attention because it is the most representative method to describe and deliver information. Text analysis is generally performed in the following order: document collection, parsing and filtering, structuring, frequency analysis, and similarity analysis. The results of the analysis can be displayed through word cloud, word network, topic modeling, document classification, and semantic analysis. Notably, there is an increasing demand to identify trending topics from the rapidly increasing text data generated through various social media. Thus, research on and applications of topic modeling have been actively carried out in various fields since topic modeling is able to extract the core topics from a huge amount of unstructured text documents and provide the document groups for each different topic. In this paper, we review the major techniques and research trends of text analysis. Further, we also introduce some cases of applications that solve the problems in various fields by using topic modeling.

Oral health literacy among foreign residents in South Korea (국내 거주 외국인의 한국형 구강건강정보 이해능력)

  • Kim, Hyun-Kyung;Jeong, Ju-Hui;Noh, Hie-Jin
    • Journal of Korean society of Dental Hygiene
    • /
    • v.16 no.6
    • /
    • pp.879-891
    • /
    • 2016
  • Objectives: This study was conducted to evaluate the oral health literacy of foreign students in Korea regarding their utilization of dental clinic services and oral care products. Methods: This study measured the oral health literacy through a self-administered questionnaire that were distributed among 145 foreign students in Seoul and 153 Korean students in Wonju, Gangwon province. The questionnaire is used to assess the oral health literacy with a total of 92 questions including 30 questions on linguistic oral health literacy, and 40 questions on functional oral health literacy (sentence translation ability 27 questions, document decoding ability 13 questions), and 22 questions on the general characteristics. The collected data were analyzed by frequency test, ${\chi}^2$, independent t-test, and ANOVA with p-value of <0.05 was considered statistically significant. Results: The linguistic oral health literacy awareness score was doubly lower in foreign students $20.5{\pm}22.4%$ than Korean students $53.9{\pm}18.4%$ (p<0.05), three words were not statistically significant with less than 10% of all the foreign and Korean students. Correct answer rate of sentence translation ability was statistically significant in all questions by foreign students $26.7{\pm}27.1%$ and Korean students $99.0{\pm}2.3%$ (p<0.05). Correct answer rate of document decoding ability showed a relatively small difference between foreign students and Korean students with $54.7{\pm}33.1%$ and $87.3{\pm}8.7%$, respectively, but it was statistically significant in all questions (p<0.05). Oral health literacy according to residence period and Korean language class level of foreign students were the most correlated among the other variables (p<0.05). Conclusions: Dental terminology was difficult for ordinary people to understand regardless of the Korean language proficiency levels, so it is recommended and needed to express dental clinical terms in simple layman's term or to use illustrations to dental patients. In case of foreign residents in Korea, interpretation services are needed. Additionally, labels and instructions of oral hygiene products retailed in Korea with the consideration for foreigners are required.

A Study on the Development of Korean Defense Standards through Text Mining-Based Trend Analysis of United States Defense Standards (텍스트 마이닝 기반의 미국 국방 표준 동향 분석을 통한 한국 국방 표준의 발전 방안 연구)

  • Chae, Soohwan;Shim, Bohyun;Yeom, Seulki;Hong, Seongdon
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.22 no.3
    • /
    • pp.651-660
    • /
    • 2021
  • This study examined the trend of standards established in the United States and to find points that can be applied to Korean defense standards. The titles of various United States defense standard documents registered on the web were selected for this research. The wordcloud was created after analyzing the frequency of words appearing in the title using text mining. The trend of words appearing in MIL-STD by era was obtained. This study identified words that appear often due to the format of the document itself, words that appear regularly throughout the era, words that are used frequently in the past but are not used much in the present, and words that did not receive attention in the past but appeared recurrently in the present. In addition, the characteristics of each document were derived through the wordcloud produced for various defense documents. In conclusion, Korean defense standards also require a consideration of safe and efficient management, transport, and load design of hazardous materials. Furthermore, the quality of defense standards can be expected to improve if the defense standard document system can be established, focusing on efficient management.

Finding Frequent Route of Taxi Trip Events Based on MapReduce and MongoDB (택시 데이터에 대한 효율적인 Top-K 빈도 검색)

  • Putri, Fadhilah Kurnia;An, Seonga;Purnaningtyas, Magdalena Trie;Jeong, Han-You;Kwon, Joonho
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.4 no.9
    • /
    • pp.347-356
    • /
    • 2015
  • Due to the rapid development of IoT(Internet of Things) technology, traditional taxis are connected through dispatchers and location systems. Typically, modern taxis have embedded with GPS(Global Positioning System), which aims for obtaining the route information. By analyzing the frequency of taxi trip events, we can find the frequent route for a given query time. However, a scalability problem would occur when we convert the raw location data of taxi trip events into the analyzed frequency information due to the volume of location data. For this problem, we propose a NoSQL based top-K query system for taxi trip events. First, we analyze raw taxi trip events and extract frequencies of all routes. Then, we store the frequency information into hash-based index structure of MongoDB which is a document-oriented NoSQL database. Efficient top-K query processing for frequent route is done with the top of the MongoDB. We validate the efficiency of our algorithms by using real taxi trip events of New York City.