• Title/Summary/Keyword: Rocchio algorithm

Search Result 8, Processing Time 0.018 seconds

A Study on the Automatic Descriptor Assignment for Scientific Journal Articles Using Rocchio Algorithm (로치오 알고리즘을 이용한 학술지 논문의 디스크 립터 자동부여에 관한 연구)

  • Kim, Pan-Jun
    • Journal of the Korean Society for information Management
    • /
    • v.23 no.3 s.61
    • /
    • pp.69-89
    • /
    • 2006
  • Several performance factors which have applied to the automatic indexing with controlled vocabulary and text categorization based on Rocchio algorithm were examined, and the simple method for performance improvement of them were tried. Also, results of the methods using Rocchio algorithm were compared with those of other learning based methods on the same conditions. As a result, keeping with the strong points which are implementational easiness and computational efficiency, the methods based Rocchio algorithms showed equivalent or better results than other learning based methods(SVM, VPT, NB). Especially, for the semi-automatic indexing(computer-aided indexing), the methods using Rocchio algorithm with a high recall level could be used preferentially.

Ranking by Inductive Inference in Collaborative Filtering Systems (협력적 여과 시스템에서 귀납 추리를 이용한 순위 결정)

  • Ko, Su-Jeong
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.9
    • /
    • pp.659-668
    • /
    • 2010
  • Collaborative filtering systems grasp behaviors for a new user and need new information for the user in order to recommend interesting items to the user. For the purpose of acquiring the information the collaborative filtering systems learn behaviors for users based on the previous data and can obtain new information from the results. In this paper, we propose an inductive inference method to obtain new information for users and rank items by using the new information in the proposed method. The proposed method clusters users into groups by learning users through NMF among inductive machine learning methods and selects the group features from the groups by using chi-square. Then, the method classifies a new user into a group by using the bayesian probability model as one of inductive inference methods based on the rating values for the new user and the features of groups. Finally, the method decides the ranks of items by applying the Rocchio algorithm to items with the missing values.

Performance Evaluation of the Extractiojn Method of Representative Keywords by Fuzzy Inference (퍼지추론 기반 대표 키워드 추출방법의 성능 평가)

  • Rho Sun-Ok;Kim Byeong Man;Oh Sang Yeop;Lee Hyun Ah
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.10 no.1
    • /
    • pp.28-37
    • /
    • 2005
  • In our previous works, we suggested a method that extracts representative keywords from a few positive documents and assigns weights to them. To show the usefulness of the method, in this paper, we evaluate the performance of a famous classification algorithm called GIS(Generalized Instance Set) when it is combined with our method. In GIS algorithm, generalized instances are built from learning documents by a generalization function and then the K-NN algorithm is applied to them. Here, our method is used as a generalization function. For comparative works, Rocchio and Widrow-Hoff algorithms are also used as a generalization function. Experimental results show that our method is better than the others for the case that only positive documents are considered, but not when negative documents are considered together.

  • PDF

A Study on the Performance Improvement of Rocchio Classifier with Term Weighting Methods (용어 가중치부여 기법을 이용한 로치오 분류기의 성능 향상에 관한 연구)

  • Kim, Pan-Jun
    • Journal of the Korean Society for information Management
    • /
    • v.25 no.1
    • /
    • pp.211-233
    • /
    • 2008
  • This study examines various weighting methods for improving the performance of automatic classification based on Rocchio algorithm on two collections(LISA, Reuters-21578). First, three factors for weighting are identified as document factor, document factor, category factor for each weighting schemes, the performance of each was investigated. Second, the performance of combined weighting methods between the single schemes were examined. As a result, for the single schemes based on each factor, category-factor-based schemes showed the best performance, document set-factor-based schemes the second, and document-factor-based schemes the worst. For the combined weighting schemes, the schemes(idf*cat) which combine document set factor with category factor show better performance than the combined schemes(tf*cat or ltf*cat) which combine document factor with category factor as well as the common schemes (tfidf or ltfidf) that combining document factor with document set factor. However, according to the results of comparing the single weighting schemes with combined weighting schemes in the view of the collections, while category-factor-based schemes(cat only) perform best on LISA, the combined schemes(idf*cat) which combine document set factor with category factor showed best performance on the Reuters-21578. Therefore for the practical application of the weighting methods, it needs careful consideration of the categories in a collection for automatic classification.

An Analytical Study on Performance Factors of Automatic Classification based on Machine Learning (기계학습에 기초한 자동분류의 성능 요소에 관한 연구)

  • Kim, Pan Jun
    • Journal of the Korean Society for information Management
    • /
    • v.33 no.2
    • /
    • pp.33-59
    • /
    • 2016
  • This study examined the factors affecting the performance of automatic classification for the domestic conference papers based on machine learning techniques. In particular, In view of the classification performance that assigning automatically the class labels to the papers in Proceedings of the Conference of Korean Society for Information Management using Rocchio algorithm, I investigated the characteristics of the key factors (classifier formation methods, training set size, weighting schemes, label assigning methods) through the diversified experiments. Consequently, It is more effective that apply proper parameters (${\beta}$, ${\lambda}$) and training set size (more than 5 years) according to the classification environments and properties of the document set. and If the performance is equivalent, I discovered that the use of the more simple methods (single weighting schemes) is very efficient. Also, because the classification of domestic papers is corresponding with multi-label classification which assigning more than one label to an article, it is necessary to develop the optimum classification model based on the characteristics of the key factors in consideration of this environment.

Optimization of Associative Word Knowledge Base using Apriori-Genetic Algorithm (연역적 유전자 알고리즘을 이용한 연관 단어 지식베이스의 최적화)

  • Go, Su-Jeong;Choe, Jun-Hyeok;Lee, Jeong-Hyeon
    • Journal of KIISE:Software and Applications
    • /
    • v.28 no.8
    • /
    • pp.560-569
    • /
    • 2001
  • 지식 기반 정보검색 시스템에서의 질의 확장은 단어간의 의미 관계를 고려한 지식베이스를 필요로 한다. 기존의 단순 마이닝 기법은 사용자의 선호도를 고려하지 않은 채 연관 단어를 추출하므로 재현율은 향상되나 정확도는 저하된다. 본 논문에서는 단어간의 의미 관게를 고려한 연관 단어 중에서 사용자가 선호하는 연관 단어만을 포함하는 정확도가 향상된 최적화된 연관 단어 지식베이스 구축을 위한 방법을 제안한다. 이를 위해 컴퓨터 분야의 웹문서를 8개의 클래스로 분류하고, 각 클래스별 웹문서에서 명사를 추출한다. 추출된 명사를 대상으로 Apriori 알고리즘을 이용하여 연관 단어를 추출하고, 유전자 알고리즘을 이용하여 사용자가 선호하지 않은 연관 단어를 지식베이스의 구축 대상에서 제외시킨다. 본 논문에서 제안된 Apriori 알고리즘과 유전자 알고리즘의 성능을 평가하기 위하여 Apriori 알고리즘은 상호 정보량과 Rocchio 알고리즘과 비교하며, 유전자 알고리즘은 TF.IDF를 이용한 단어 정제 방법과 비교한다.

  • PDF

Evaluation of User Profile Construction Method by Fuzzy Inference

  • Kim, Byeong-Man;Rho, Sun-Ok;Oh, Sang-Yeop;Lee, Hyun-Ah;Kim, Jong-Wan
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.8 no.3
    • /
    • pp.175-184
    • /
    • 2008
  • To construct user profiles automatically, an extraction method for representative keywords from a set of documents is needed. In our previous works, we suggested such a method and showed its usefulness. Here, we apply it to the classification problem and observe how much it contributes to performance improvement. The method can be used as a linear document classifier with few modifications. So, we first evaluate its performance for that case. The method is also applicable to some non-linear classification methods such as GIS (Generalized Instance Set). In GIS algorithm, generalized instances are built from training documents by a generalization function and then the K-NN algorithm is applied to them, where the method can be used as a generalization function. For comparative works, two famous linear classification methods, Rocchio and Widrow-Hoff algorithms, are also used. Experimental results show that our method is better than the others for the case that only positive documents are considered, but not when negative documents are considered together.

An Analytical Study on Automatic Classification of Domestic Journal articles Based on Machine Learning (기계학습에 기초한 국내 학술지 논문의 자동분류에 관한 연구)

  • Kim, Pan Jun
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.2
    • /
    • pp.37-62
    • /
    • 2018
  • This study examined the factors affecting the performance of automatic classification based on machine learning for domestic journal articles in the field of LIS. In particular, In view of the classification performance that assigning automatically the class labels to the articles in "Journal of the Korean Society for Information Management", I investigated the characteristics of the key factors(weighting schemes, training set size, classification algorithms, label assigning methods) through the diversified experiments. Consequently, It is effective to apply each element appropriately according to the classification environment and the characteristics of the document set, and a fairly good performance can be obtained by using a simpler model. In addition, the classification of domestic journals can be considered as a multi-label classification that assigns more than one category to a specific article. Therefore, I proposed an optimal classification model using simple and fast classification algorithm and small learning set considering this environment.