• Title/Summary/Keyword: tf-idf

Search Result 348, Processing Time 0.026 seconds

Classifying Sub-Categories of Apartment Defect Repair Tasks: A Machine Learning Approach (아파트 하자 보수 시설공사 세부공종 머신러닝 분류 시스템에 관한 연구)

  • Kim, Eunhye;Ji, HongGeun;Kim, Jina;Park, Eunil;Ohm, Jay Y.
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.9
    • /
    • pp.359-366
    • /
    • 2021
  • A number of construction companies in Korea invest considerable human and financial resources to construct a system for managing apartment defect data and for categorizing repair tasks. Thus, this study proposes machine learning models to automatically classify defect complaint text-data into one of the sub categories of 'finishing work' (i.e., one of the defect repair tasks). In the proposed models, we employed two word representation methods (Bag-of-words, Term Frequency-Inverse Document Frequency (TF-IDF)) and two machine learning classifiers (Support Vector Machine, Random Forest). In particular, we conducted both binary- and multi- classification tasks to classify 9 sub categories of finishing work: home appliance installation work, paperwork, painting work, plastering work, interior masonry work, plaster finishing work, indoor furniture installation work, kitchen facility installation work, and tiling work. The machine learning classifiers using the TF-IDF representation method and Random Forest classification achieved more than 90% accuracy, precision, recall, and F1 score. We shed light on the possibility of constructing automated defect classification systems based on the proposed machine learning models.

A Study on the Changes in Perspectives on Unwed Mothers in S.Korea and the Direction of Government Polices: 1995~2020 Social Media Big Data Analysis (한국미혼모에 대한 관점 변화와 정부정책의 방향: 1995년~2020년 소셜미디어 빅데이터 분석)

  • Seo, Donghee;Jun, Boksun
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.12
    • /
    • pp.305-313
    • /
    • 2021
  • This study collected and analyzed big data from 1995 to 2020, focusing on the keywords "unwed mother", "single mother," and "single mom" to present appropriate government support policy directions according to changes in perspectives on unwed mothers. Big data collection platform Textom was used to collect data from portal search sites Naver and Daum and refine data. The final refined data were word frequency analysis, TF-IDF analysis, an N-gram analysis provided by Textom. In addition, Network analysis and CONCOR analysis were conducted through the UCINET6 program. As a result of the study, similar words appeared in word frequency analysis and TF-IDF analysis, but they differed by year. In the N-gram analysis, there were similarities in word appearance, but there were many differences in frequency and form of words appearing in series. As a result of CONCOR analysis, it was found that different clusters were formed by year. This study confirms the change in the perspective of unwed mothers through big data analysis, suggests the need for unwed mothers policies for various options for independent women, and policies that embrace pregnancy, childbirth, and parenting without discrimination within the new family form.

Analysis of Major COVID-19 Issues Using Unstructured Big Data (비정형 빅데이터를 이용한 COVID-19 주요 이슈 분석)

  • Kim, Jinsol;Shin, Donghoon;Kim, Heewoong
    • Knowledge Management Research
    • /
    • v.22 no.2
    • /
    • pp.145-165
    • /
    • 2021
  • As of late December 2019, the spread of COVID-19 pandemic began which put the entire world in panic. In order to overcome the crisis and minimize any subsequent damage, the government as well as its affiliated institutions must maximize effects of pre-existing policy support and introduce a holistic response plan that can reflect this changing situation- which is why it is crucial to analyze social topics and people's interests. This study investigates people's major thoughts, attitudes and topics surrounding COVID-19 pandemic through the use of social media and big data. In order to collect public opinion, this study segmented time period according to government countermeasures. All data were collected through NAVER blog from 31 December 2019 to 12 December 2020. This research applied TF-IDF keyword extraction and LDA topic modeling as text-mining techniques. As a result, eight major issues related to COVID-19 have been derived, and based on these keywords, this research presented policy strategies. The significance of this study is that it provides a baseline data for Korean government authorities in providing appropriate countermeasures that can satisfy needs of people in the midst of COVID-19 pandemic.

Analysis of Research Trends in Tax Compliance using Topic Modeling (토픽모델링을 활용한 조세순응 연구 동향 분석)

  • Kang, Min-Jo;Baek, Pyoung-Gu
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.1
    • /
    • pp.99-115
    • /
    • 2022
  • In this study, domestic academic journal papers on tax compliance, tax consciousness, and faithful tax payment (hereinafter referred to as "tax compliance") were comprehensively analyzed from an interdisciplinary perspective as a representative research topic in the field of tax science. To achieve the research purpose, topic modeling technique was applied as part of text mining. In the flow of data collection-keyword preprocessing-topic model analysis, potential research topics were presented from tax compliance related keywords registered by the researcher in a total of 347 papers. The results of this study can be summarized as follows. First, in the keyword analysis, keywords such as tax investigation, tax avoidance, and honest tax reporting system were included in the top 5 keywords based on simple term-frequency, and in the TF-IDF value considering the relative importance of keywords, they were also included in the top 5 keywords. On the other hand, the keyword, tax evasion, was included in the top keyword based on the TF-IDF value, whereas it was not highlighted in the simple term-frequency. Second, eight potential research topics were derived through topic modeling. The topics covered are (1) tax fairness and suppression of tax offenses, (2) the ideology of the tax law and the validity of tax policies, (3) the principle of substance over form and guarantee of tax receivables (4) tax compliance costs and tax administration services, (5) the tax returns self- assessment system and tax experts, (6) tax climate and strategic tax behavior, (7) multifaceted tax behavior and differential compliance intentions, (8) tax information system and tax resource management. The research comprehensively looked at the various perspectives on the tax compliance from an interdisciplinary perspective, thereby comprehensively grasping past research trends on tax compliance and suggesting the direction of future research.

Open Domain Machine Reading Comprehension using InferSent (InferSent를 활용한 오픈 도메인 기계독해)

  • Jeong-Hoon, Kim;Jun-Yeong, Kim;Jun, Park;Sung-Wook, Park;Se-Hoon, Jung;Chun-Bo, Sim
    • Smart Media Journal
    • /
    • v.11 no.10
    • /
    • pp.89-96
    • /
    • 2022
  • An open domain machine reading comprehension is a model that adds a function to search paragraphs as there are no paragraphs related to a given question. Document searches have an issue of lower performance with a lot of documents despite abundant research with word frequency based TF-IDF. Paragraph selections also have an issue of not extracting paragraph contexts, including sentence characteristics accurately despite a lot of research with word-based embedding. Document reading comprehension has an issue of slow learning due to the growing number of parameters despite a lot of research on BERT. Trying to solve these three issues, this study used BM25 which considered even sentence length and InferSent to get sentence contexts, and proposed an open domain machine reading comprehension with ALBERT to reduce the number of parameters. An experiment was conducted with SQuAD1.1 datasets. BM25 recorded a higher performance of document research than TF-IDF by 3.2%. InferSent showed a higher performance in paragraph selection than Transformer by 0.9%. Finally, as the number of paragraphs increased in document comprehension, ALBERT was 0.4% higher in EM and 0.2% higher in F1.

Comparative Study of User Reactions in OTT Service Platforms Using Text Mining (텍스트 마이닝을 활용한 OTT 서비스 플랫폼별 사용자 반응 비교 연구)

  • Soonchan Kwon;Jieun Kim;Beakcheol Jang
    • Journal of Internet Computing and Services
    • /
    • v.25 no.3
    • /
    • pp.43-54
    • /
    • 2024
  • This study employs text mining techniques to compare user responses across various Over-The-Top (OTT) service platforms. The primary objective of the research is to understand user satisfaction with OTT service platforms and contribute to the formulation of more effective review strategies. The key questions addressed in this study involve identifying prominent topics and keywords in user reviews of different OTT services and comprehending platform-specific user reactions. TF-IDF is utilized to extract significant words from positive and negative reviews, while BERTopic, an advanced topic modeling technique, is employed for a more nuanced and comprehensive analysis of intricate user reviews. The results from TF-IDF analysis reveal that positive app reviews exhibit a high frequency of content-related words, whereas negative reviews display a high frequency of words associated with potential issues during app usage. Through the utilization of BERTopic, we were able to extract keywords related to content diversity, app performance components, payment, and compatibility, by associating them with content attributes. This enabled us to verify that the distinguishing attributes of the platforms vary among themselves. The findings of this study offer significant insights into user behavior and preferences, which OTT service providers can leverage to improve user experience and satisfaction. We also anticipate that researchers exploring deep learning models will find our study results valuable for conducting analyses on user review text data.

A Study on Negation Handling and Term Weighting Schemes and Their Effects on Mood-based Text Classification (감정 기반 블로그 문서 분류를 위한 부정어 처리 및 단어 가중치 적용 기법의 효과에 대한 연구)

  • Jung, Yu-Chul;Choi, Yoon-Jung;Myaeng, Sung-Hyon
    • Korean Journal of Cognitive Science
    • /
    • v.19 no.4
    • /
    • pp.477-497
    • /
    • 2008
  • Mood classification of blog text is an interesting problem, with a potential for a variety of services involving the Web. This paper introduces an approach to mood classification enhancements through the normalized negation n-grams which contain mood clues and corpus-specific term weighting(CSTW). We've done experiments on blog texts with two different classification methods: Enhanced Mood Flow Analysis(EMFA) and Support Vector Machine based Mood Classification(SVMMC). It proves that the normalized negation n-gram method is quite effective in dealing with negations and gave gradual improvements in mood classification with EMF A. From the selection of CSTW, we noticed that the appropriate weighting scheme is important for supporting adequate levels of mood classification performance because it outperforms the result of TF*IDF and TF.

  • PDF

The Study on the patient safety culture convergence research topics through text mining and CONCOR analysis (텍스트마이닝 및 CONCOR 분석을 활용한 환자안전문화 융복합 연구주제 분석)

  • Baek, Su Mi;Moon, Inn Oh
    • Journal of Digital Convergence
    • /
    • v.19 no.12
    • /
    • pp.359-367
    • /
    • 2021
  • The purpose of this study is to analyze domestic patient safety culture research topics using text mining and CONCOR analysis. The research method was conducted in the stages of data collection, data preprocessing, text mining and social network analysis, and CONCOR analysis. A total of 136 articles were analyzed excluding papers that were not published. Data analysis was performed using Textom and UCINET programs. As a result of this study, TF (frequency) of patient safety culture-related studies showed that patient safety was the highest, and TF-IDF (importance in documents) was highest in nursing. As a result of the CONCOR analysis, a total of seven clusters were derived: knowledge and attitude, communication, medical service, team, work environment, structure, organization and management that constitute the patient safety culture. In the future, it is necessary to conduct research on the relationship between the establishment of a patient safety culture and patient outcomes.

A New Scheme Exploiting the Related Keyword and Big Data Analysis for Predicting Promise Technology in the Field of Satellite·Terrestrial Information Convergence Disaster Response (위성·지상정보 융합 재난 대응 기술 분야 유망기술 도출을 위한 연관 키워드 및 빅데이터 분석 기법)

  • Lee, Hangwon;Kim, Youngok
    • Journal of the Society of Disaster Information
    • /
    • v.18 no.2
    • /
    • pp.418-431
    • /
    • 2022
  • Purpose: We propose a new scheme for predicting promise technology and it improves the conventional scheme that misses important lists of patent because of insufficient search formula, and cannot reflect new trend of technology due to the unreleased period of patents. Method: In this paper, we propose a new search formula exploiting TF and TF-IDF with R programming as well as related keywords, and LDA topic modeling scheme is used for analyzing recently published papers in Satellite·Terrestrial Information Convergence Disaster Response. Result: By comparing both schemes with commercial DB, the proposed scheme can find more important patents, and can reflect new trend of technology, compared to the conventional scheme. Conclusion: The proposed scheme can be used to predict promise technologies in the field of Satellite·Terrestrial Information Convergence Disaster Response.

Big Data Analysis on the Perception of Home Training According to the Implementation of COVID-19 Social Distancing

  • Hyun-Chang Keum;Kyung-Won Byun
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.15 no.3
    • /
    • pp.211-218
    • /
    • 2023
  • Due to the implementation of COVID-19 distancing, interest and users in 'home training' are rapidly increasing. Therefore, the purpose of this study is to identify the perception of 'home training' through big data analysis on social media channels and provide basic data to related business sector. Social media channels collected big data from various news and social content provided on Naver and Google sites. Data for three years from March 22, 2020 were collected based on the time when COVID-19 distancing was implemented in Korea. The collected data included 4,000 Naver blogs, 2,673 news, 4,000 cafes, 3,989 knowledge IN, and 953 Google channel news. These data analyzed TF and TF-IDF through text mining, and through this, semantic network analysis was conducted on 70 keywords, big data analysis programs such as Textom and Ucinet were used for social big data analysis, and NetDraw was used for visualization. As a result of text mining analysis, 'home training' was found the most frequently in relation to TF with 4,045 times. The next order is 'exercise', 'Homt', 'house', 'apparatus', 'recommendation', and 'diet'. Regarding TF-IDF, the main keywords are 'exercise', 'apparatus', 'home', 'house', 'diet', 'recommendation', and 'mat'. Based on these results, 70 keywords with high frequency were extracted, and then semantic indicators and centrality analysis were conducted. Finally, through CONCOR analysis, it was clustered into 'purchase cluster', 'equipment cluster', 'diet cluster', and 'execute method cluster'. For the results of these four clusters, basic data on the 'home training' business sector were presented based on consumers' main perception of 'home training' and analysis of the meaning network.