• Title/Summary/Keyword: Term Frequency-Inverse Document Frequency

Search Result 91, Processing Time 0.027 seconds

A Study on Keyword Extraction From a Single Document Using Term Clustering (용어 클러스터링을 이용한 단일문서 키워드 추출에 관한 연구)

  • Han, Seung-Hee
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.44 no.3
    • /
    • pp.155-173
    • /
    • 2010
  • In this study, a new keyword extraction algorithm is applied to a single document with term clustering. A single document is divided by multiple passages, and two ways of calculating similarities between two terms are investigated; the first-order similarity and the second-order distributional similarity. In this experiment, the best cluster performance is achieved with a 50-term passage from the second-order distributional similarity. From the results of first experiment, the second-order distribution similarity was also applied to various keyword extraction methods using statistic information of terms. In the second experiment, pf(paragraph frequency) and $tf{\times}ipf$(term frequency by inverse paragraph frequency) were found to improve the overall performance of keyword extraction. Therefore, it showed that the algorithm fulfills the necessary conditions which good keywords should have.

Evaluation Model for Gab Analysis Between NCS Competence Unit Element and Traditional Curriculum (NCS 능력단위 요소와 기존 교육과정 간 갭 분석을 위한 평가모델)

  • Kim, Dae-kyung;Kim, Chang-Bok
    • Journal of Advanced Navigation Technology
    • /
    • v.19 no.4
    • /
    • pp.338-344
    • /
    • 2015
  • The national competency standards (NCS) is a systematize and standardize for skills required to perform their job. The NCS has developed a learning module with materialization and standardize by competence unit element, which is the unit of specific job competency. The existing curriculum is material to gab analysis for use in education training with competence unit element. The existing gab analysis has evaluated subjectively by experts. The gab analysis by experts bring up a subject subjective decision, accuracy lack, temporal and spatial inefficiency by psychological factor. This paper is proposed automated evaluation model for problem resolve of subjective evaluation. This paper use index term extraction, term frequency-inverse document frequency for feature value extraction, cosine similarity algorithm for gab analysis between existing curriculum and competence unit element. This paper was presented similarity mapping table between existing curriculum and competence unit element. The evaluation model in this paper should be complemented by an improved algorithm from the structural characteristics and speed.

An Ensemble Approach for Cyber Bullying Text messages and Images

  • Zarapala Sunitha Bai;Sreelatha Malempati
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.11
    • /
    • pp.59-66
    • /
    • 2023
  • Text mining (TM) is most widely used to find patterns from various text documents. Cyber-bullying is the term that is used to abuse a person online or offline platform. Nowadays cyber-bullying becomes more dangerous to people who are using social networking sites (SNS). Cyber-bullying is of many types such as text messaging, morphed images, morphed videos, etc. It is a very difficult task to prevent this type of abuse of the person in online SNS. Finding accurate text mining patterns gives better results in detecting cyber-bullying on any platform. Cyber-bullying is developed with the online SNS to send defamatory statements or orally bully other persons or by using the online platform to abuse in front of SNS users. Deep Learning (DL) is one of the significant domains which are used to extract and learn the quality features dynamically from the low-level text inclusions. In this scenario, Convolutional neural networks (CNN) are used for training the text data, images, and videos. CNN is a very powerful approach to training on these types of data and achieved better text classification. In this paper, an Ensemble model is introduced with the integration of Term Frequency (TF)-Inverse document frequency (IDF) and Deep Neural Network (DNN) with advanced feature-extracting techniques to classify the bullying text, images, and videos. The proposed approach also focused on reducing the training time and memory usage which helps the classification improvement.

An Optimal Weighting Method in Supervised Learning of Linguistic Model for Text Classification

  • Mikawa, Kenta;Ishida, Takashi;Goto, Masayuki
    • Industrial Engineering and Management Systems
    • /
    • v.11 no.1
    • /
    • pp.87-93
    • /
    • 2012
  • This paper discusses a new weighting method for text analyzing from the view point of supervised learning. The term frequency and inverse term frequency measure (tf-idf measure) is famous weighting method for information retrieval, and this method can be used for text analyzing either. However, it is an experimental weighting method for information retrieval whose effectiveness is not clarified from the theoretical viewpoints. Therefore, other effective weighting measure may be obtained for document classification problems. In this study, we propose the optimal weighting method for document classification problems from the view point of supervised learning. The proposed measure is more suitable for the text classification problem as used training data than the tf-idf measure. The effectiveness of our proposal is clarified by simulation experiments for the text classification problems of newspaper article and the customer review which is posted on the web site.

Design of Document Suggestion System based on TF-IDF Algorithm for Efficient Organization of Documentation (효율적인 문서 구성을 위한 TF-IDF 알고리즘 기반 문서 제안 시스템의 설계)

  • Kim, Young-Hoon;Park, Seung-Min;Cho, Dae-Soo
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2022.07a
    • /
    • pp.527-528
    • /
    • 2022
  • 빠르게 변하는 환경에 맞춰 평생 교육이 일반화되고 개인에게 요구되는 학습량은 많아지고 있으며 높아진 학습량에 맞게 학습 시간 단축과 효율적인 학습을 위한 학습 방법을 선택하는 것이 중요해지고 있다. 본 논문에서는 학습 정리를 위해 작성한 문서를 분석하여 해당 문서와 관련된 문서를 제안하고 본 문서와 엮어 학습을 위한 문서 묶음을 만들 수 있는 시스템을 제안한다. 문서의 유사도, 중요도를 구할 수 있는 TF-IDF를 이용하여 문서를 분석해 키워드를 추출한 다음 그와 관련된 문서를 제안하고 문서 묶음을 만들어 조회할 수 있도록 한다. 이 시스템은 학습 정리 시 관련 문서를 함께 볼 수 있도록 하고, 필요하다면 묶음으로 만들어 효과적인 학습을 위한 도구로 이용할 수 있다.

  • PDF

Feature-selection algorithm based on genetic algorithms using unstructured data for attack mail identification (공격 메일 식별을 위한 비정형 데이터를 사용한 유전자 알고리즘 기반의 특징선택 알고리즘)

  • Hong, Sung-Sam;Kim, Dong-Wook;Han, Myung-Mook
    • Journal of Internet Computing and Services
    • /
    • v.20 no.1
    • /
    • pp.1-10
    • /
    • 2019
  • Since big-data text mining extracts many features and data, clustering and classification can result in high computational complexity and low reliability of the analysis results. In particular, a term document matrix obtained through text mining represents term-document features, but produces a sparse matrix. We designed an advanced genetic algorithm (GA) to extract features in text mining for detection model. Term frequency inverse document frequency (TF-IDF) is used to reflect the document-term relationships in feature extraction. Through a repetitive process, a predetermined number of features are selected. And, we used the sparsity score to improve the performance of detection model. If a spam mail data set has the high sparsity, detection model have low performance and is difficult to search the optimization detection model. In addition, we find a low sparsity model that have also high TF-IDF score by using s(F) where the numerator in fitness function. We also verified its performance by applying the proposed algorithm to text classification. As a result, we have found that our algorithm shows higher performance (speed and accuracy) in attack mail classification.

A Study on Analysis of Topic Modeling using Customer Reviews based on Sharing Economy: Focusing on Sharing Parking (공유경제 기반의 고객리뷰를 이용한 토픽모델링 분석: 공유주차를 중심으로)

  • Lee, Taewon
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.25 no.3
    • /
    • pp.39-51
    • /
    • 2020
  • This study will examine the social issues and consumer awareness of sharing parking through the method text mining. In this experiment, the topic by keyword was extracted and analyzed using TFIDF (Term frequency inverse document frequency) and LDA (Latent dirichlet allocation) technique. As a result of categorization by topic, citizens' complaints such as local government agreements, parking space negotiations, parking culture improvement, citizen participation, etc., played an important role in implementing shared parking services. The contribution of this study highly differentiated from previous studies that conducted exploratory studies using corporate and regional cases, and can be said to have a high academic contribution. In addition, based on the results obtained by utilizing the LDA analysis in this study, there is a practical contribution that it can be applied or utilized in establishing a sharing economy policy for revitalizing the local economy.

Comparison of Product and Customer Feature Selection Methods for Content-based Recommendation in Internet Storefronts (인터넷 상점에서의 내용기반 추천을 위한 상품 및 고객의 자질 추출 성능 비교)

  • Ahn Hyung-Jun;Kim Jong-Woo
    • The KIPS Transactions:PartD
    • /
    • v.13D no.2 s.105
    • /
    • pp.279-286
    • /
    • 2006
  • One of the widely used methods for product recommendation in Internet storefronts is matching product features against target customer profiles. When using this method, it's very important to choose a suitable subset of features for recommendation efficiency and performance, which, however, has not been rigorously researched so far. In this paper, we utilize a dataset collected from a virtual shopping experiment in a Korean Internet book shopping mall to compare several popular methods from other disciplines for selecting features for product recommendation: the vector-space model, TFIDF(Term Frequency-Inverse Document Frequency), the mutual information method, and the singular value decomposition(SVD). The application of SVD showed the best performance in the analysis results.

A Study on Change in Perception of Community Service and Demand Prediction based on Big Data

  • Chun-Ok, Jang
    • International Journal of Advanced Culture Technology
    • /
    • v.10 no.4
    • /
    • pp.230-237
    • /
    • 2022
  • The Community Social Service Investment project started as a state subsidy project in 2007 and has grown very rapidly in quantitative terms in a short period of time. It is a bottom-up project that discovers the welfare needs of people and plans and provides services suitable for them. The purpose of this study is to analyze using big data to determine the social response to local community service investment projects. For this, data was collected and analyzed by crawling with a specific keyword of community service investment project on Google and Naver sites. As for the analysis contents, monthly search volume, related keywords, monthly search volume, search rate by age, and gender search rate were conducted. As a result, 10 items were found as related keywords in Google, and 3 items were found in Naver. The overall results of Google and Naver sites were slightly different, but they increased and decreased at almost the same time. Therefore, it can be seen that the community service investment project continues to attract users' interest.

Design of WWW IR System Based on Keyword Clustering Architecture (색인어 말뭉치 처리를 기반으로 한 웹 정보검색 시스템의 설계)

  • 송점동;이정현;최준혁
    • The Journal of Information Technology
    • /
    • v.1 no.1
    • /
    • pp.13-26
    • /
    • 1998
  • In general Information retrieval systems, improper keywords are often extracted and different search results are offered comparing to user's aim bacause the systems use only term frequency informations for selecting keywords and don't consider their meanings. It represents that improving precision is limited without considering semantics of keywords because recall ratio and precision have inverse proportion relation. In this paper, a system which is able to improve precision without decreasing recall ratio is designed and implemented, as client user module is introduced which can send feedbacks to server with user's intention. For this purpose, keywords are selected using relative term frequency and inverse document frequency and co-occurrence words are extracted from original documents. Then, the keywords are clustered by their semantics using calculated mutual informations. In this paper, the system can reject inappropriate documents using segmented semantic informations according to feedbacks from client user module. Consequently precision of the system is improved without decreasing recall ratio.

  • PDF