Web Structure Mining by Extracting Hyperlinks from Web Documents and Access Logs

Lee, Seong-Dae;Park, Hyu-Chan;

doi:10.6109/jkiice.2007.11.11.2059

Journal of the Korea Institute of Information and Communication Engineering (한국정보통신학회논문지)

Volume 11 Issue 11
/
Pages.2059-2071
/
2007
/
2234-4772(pISSN)
/
2288-4165(eISSN)

The Korea Institute of Information and Commucation Engineering (한국정보통신학회)

DOI QR Code

Web Structure Mining by Extracting Hyperlinks from Web Documents and Access Logs

웹 문서와 접근로그의 하이퍼링크 추출을 통한 웹 구조 마이닝

이성대 (한국해양대학교 산학협력단) ;
박휴찬 (한국해양대학교 컴퓨터.제어.전자통신공학부)

Published : 2007.11.30

https://doi.org/10.6109/jkiice.2007.11.11.2059 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

If the correct structure of Web site is known, the information provider can discover users# behavior patterns and characteristics for better services, and users can find useful information easily and exactly. There may be some difficulties, however, to extract the exact structure of Web site because documents one the Web tend to be changed frequently. This paper proposes new method for extracting such Web structure automatically. The method consists of two phases. The first phase extracts the hyperlinks among Web documents, and then constructs a directed graph to represent the structure of Web site. It has limitations, however, to discover the hyperlinks in Flash and Java Applet. The second phase is to find such hidden hyperlinks by using Web access log. It fist extracts the click streams from the access log, and then extract the hidden hyperlinks by comparing with the directed graph. Several experiments have been conducted to evaluate the proposed method.

웹 사이트의 구조가 정확하게 주어진다면, 정보 제공자의 입장에서는 사용자의 행위 패턴이나 특성을 효과적으로 파악할 수 있어 보다 나은 서비스를 제공할 수 있고, 사용자의 입장에서는 더욱 쉽고 정확하게 유용한 정보를 찾을 수 있을 것이다. 하지만 웹상의 문서들은 빈발하게 수정되기 때문에 웹 사이트의 구조를 정확하게 추출하는 것은 상당한 어려움이 있다. 본 논문에서는 이러한 웹 사이트의 구조를 자동으로 추출하는 알고리즘을 제안한다. 제안하는 알고리즘은 두 단계로 구성된다. 첫 번째 단계는 웹 문서를 분석하여 그들 간의 하이퍼링크를 추출하고 이를 웹 사이트의 구조를 나타내는 방향 그래프로 표현한다. 하지만 플래시나 자바 애플릿에 포함된 하이퍼링크는 추출할 수 없는 한계가 있다. 두 번째 단계에서는 이러한 숨겨진 하이퍼링크를 추출하기 위하여 웹 사이트의 접근로그를 이용한다. 즉, 접근로그로부터 각 사용자의 클릭스트림을 추출한 후, 첫 번째 단계에서 생성한 그래프와 비교하여 숨겨진 하이퍼링크를 추출한다. 본 논문에서 제안한 알고리즘의 성능을 평가하기 위하여 다양한 실험을 수행하였고, 이러한 실험을 통하여 웹 사이트의 구조를 보다 정확하게 추출할 수 있음을 확인하였다.

Keywords

References

Y. Zou and K. Kontogiannis, 'Migrating and Specifying Services for Web Integration', Lecture Notes In Computer Science, Vol. 1999, pp. 253-270, 2000
M. Baglioni, U. Ferrara, A. Romei, S. Ruggieri and F. Turini, 'Preprocessing and Mining Web Log Data for Web Personalization', Lecture Note in Computer Science, Vol. 2829, pp. 237-249, 2003
Y. Kosala and H. Blockeel, 'Web Mining Research, A Survey', Newsletter of the Special Interest Group on Knowledge Discovery & Data Mining, Vol. 2, pp. 1-15, 2000
J. Huysmans, B. Baesens and J. Vanthienen, 'Web Usage Mining: A Practical Study', Proceedings of the Twelfth Conference on Knowledge Acquisition and Management, pp.86-99, 2004
R. Cooley, B. Mobasher and J. Srivastava, 'Data Preparation for Mining World Wide Web Browsing Patterns', Knowledge and Information System, Vol. 1, pp. 1-26, 1999 https://doi.org/10.1007/BF03325088
M. S. Chen, 'Efficient Data Mining for Path Traversal Pattern', IEEE Transactions on Knowledge and Data Engineering, Vol. 10, No. 2, pp. 209-221, 1996 https://doi.org/10.1109/69.683753
S. Chakrabarti, Mining the Web, Morgan Kaufmann, 2002
Thuraisingham and M. Bhavani, Web Data Mining and Business Intelligence Analysis, CRC Press, 2003
김진수, 김태용, 최준혁, 임기욱, 이정현, '사용자 로그 분석과 클러스터 내의 문서 유사도를 이용한 동적 추천 시스템', 한국정보과학회 정보과학논문지, 소프트웨어 및 응용 제31권 제5호, pp. 586-594, 2004
M. Koutri, N. Avouris and S. Daskalaki, 'A Survey on Web Usage Mining Techniques for Web-Based Adaptive Hypermedia System', Information Resources Management Association, Vol. 13, pp. 125-149, 2005
최영환, 이상용, '웹 마이닝을 위한 입력 데이터의 전처리과정에서 사용자구분과 세션보정', 한국정보과학회 정보과학회논문지, 소프트웨어 및 응용, 제30권 제9.10호, pp. 843-849, 2003

Journal of the Korea Institute of Information and Communication Engineering (한국정보통신학회논문지)

Web Structure Mining by Extracting Hyperlinks from Web Documents and Access Logs

웹 문서와 접근로그의 하이퍼링크 추출을 통한 웹 구조 마이닝

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)