Performance Improvement of Web Document Classification through Incorporation of Feature Selection and Weighting

Lee, Ah-Ram;Kim, Han-Joon;Man, Xuan;

doi:10.7236/JIIBC.2013.13.4.141

The Journal of the Institute of Internet, Broadcasting and Communication (한국인터넷방송통신학회논문지)

Volume 13 Issue 4
/
Pages.141-148
/
2013
/
2289-0238(pISSN)
/
2289-0246(eISSN)

The Institute of Internet, Broadcasting and Communication (한국인터넷방송통신학회)

DOI QR Code

Performance Improvement of Web Document Classification through Incorporation of Feature Selection and Weighting

특징선택과 특징가중의 융합을 통한 웹문서분류 성능의 개선

Lee, Ah-Ram (School of Electrical and Computer Engineering, University of Seoul) ;
Kim, Han-Joon (School of Electrical and Computer Engineering, University of Seoul) ;
Man, Xuan (School of Electrical and Computer Engineering, University of Seoul)

이아람 (서울시립대학교 전자전기컴퓨터공학부) ;
김한준 (서울시립대학교 전자전기컴퓨터공학부) ;
현만 (서울시립대학교 전자전기컴퓨터공학부)

Received : 2013.07.09
Accepted : 2013.08.16
Published : 2013.08.31

https://doi.org/10.7236/JIIBC.2013.13.4.141 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Automated classification systems which utilize machine learning develops classification models through learning process, and then classify unknown data into predefined set of categories according to the model. The performance of machine learning-based classification systems relies greatly upon the quality of features composing classification models. For textual data, we can use their word terms and structure information in order to generate the set of features. Particularly, in order to extract feature from Web documents, we need to analyze tag and hyperlink information. Recent studies on Web document classification focus on feature engineering technology other than machine learning algorithms themselves. Thus this paper proposes a novel method of incorporating feature selection and weighting which can improves classification models effectively. Through extensive experiments using Web-KB document collections, the proposed method outperforms conventional ones.

기계학습을 이용한 자동분류시스템은 학습과정을 통해 분류모델을 구축하고 이를 기반으로 미분류 데이터를 특정 카테고리로 분류한다. 기계학습 기반 자동분류 시스템의 성능은 분류모델의 구성 인자인 특징의 품질에 크게 의존한다. 문서 데이터의 경우 특징 집합을 생성하기 위해 문서내의 출현단어와 문서의 구조적 정보를 활용한다. 특히 웹문서로부터 특징을 추출하기 위해 단어뿐만 아니라 태그, 하이퍼링크 정보를 분석할 수 있다. 최근 웹문서의 분류 기법에 대한 연구는 기계학습 알고리즘보다 특징 생성 및 가공 기술에 초점을 맞추고 있다. 이에 본 논문은 웹문서의 분류모델을 개선하기 위해 단어, 태그, 하이퍼링크 정보로부터 고품질의 특징을 선별 추출하여 가중치를 자동으로 부여하는 기법을 제안한다. Web-KB 문서집합을 이용한 다양한 실험을 통해 제안 기법의 우수성을 보인다.

Keywords

References

J. Gantz, and D. Reinsel, "Extracting Value from Chaos", http://www.emc.com/collateral/analyst-reports/, 2011
Hye-young Yang, "Technology Planning Method using Big Data", Korea Institute of S&T Evaluation and Planning (KISTEP), 2012
The Value and Benefits of Text Mining, JISC Digital Infrastructure, 2012
J. Kim, and M. Kim, "A Study on the Implementation of SNS Message Classification by Emotion Factors", The Journal of the Institute of Internet, Broadcasting and Communiction, Vol. 11, No. 4, pp. 217-222, 2011
J. Joo, and Y. Yoon, "Pattern Analysis and Prediction System for Meme Data", Journal of Korean Institute of Information Technology, Vol. 9, No. 9, pp. 163-177, 2011
T.M. Mitchell, "Machine Learning", McGraw-Hill, 1997
H. Altincay, "Feature Extraction Using Single Variable Classifiers for Binary Text Classification", Lecture Notes in Computer Science, Vol. 7906, pp 332-340, 2013
X. Qi, and B. D. Davison, "Web page classification: Features and algorithms", ACM Computing Surveys, Vol. 41, No. 2, Article No. 12, 2009
S. Chakrabarti, B. Dom, and P. Indyk, "Enhanced hypertext categorization using hyperlinks", Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 307-318, 1998
H. Utard, and J. Furnkranz, "Link-Local Features for Hypertext Classification", Lecture Notes in Computer Science, Vol. 4289, pp. 58-69, 2005
S. Brin, and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine", Seventh International World-Wide Web Conference, pp. 14-18, 1998
H. Benbrahim, and M. Bramer, "Impact on Performance of Hypertext Classification of Selective Rich HTML Capture", Artificial Intelligence Applications and Innovations (AIAI-2004), pp. 22-27, 2004

The Journal of the Institute of Internet, Broadcasting and Communication (한국인터넷방송통신학회논문지)

Performance Improvement of Web Document Classification through Incorporation of Feature Selection and Weighting

특징선택과 특징가중의 융합을 통한 웹문서분류 성능의 개선

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)