A Focused Crawler by Segmentation of Context Information

Cho, Chang-Hee;Lee, Nam-Yong;Kang, Jin-Bum;Yang, Jae-Young;Choi, Joong-Min;

doi:10.3745/KIPSTB.2005.12B.6.697

The KIPS Transactions:PartB (정보처리학회논문지B)

Volume 12B Issue 6 Serial No. 102
/
Pages.697-702
/
2005
/
1598-284X(pISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

A Focused Crawler by Segmentation of Context Information

주변정보 분할을 이용한 주제 중심 웹 문서 수집기

조창희 (법제처법령정보화) ;
이남용 (숭실대학교 컴퓨터학과) ;
강진범 (한양대학교 대학원) ;
양재영 (동부정보기술주식회사 컨설팅사업팀) ;
최중민 (한양대학교 컴퓨터공학과)

Published : 2005.10.01

https://doi.org/10.3745/KIPSTB.2005.12B.6.697 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

The focused crawler is a topic-driven document-collecting crawler that was suggested as a promising alternative of maintaining up-to-date web document Indices in search engines. A major problem inherent in previous focused crawlers is the liability of missing highly relevant documents that are linked from off-topic documents. This problem mainly originated from the lack of consideration of structural information in a document. Traditional weighting method such as TFIDF employed in document classification can lead to this problem. In order to improve the performance of focused crawlers, this paper proposes a scheme of locality-based document segmentation to determine the relevance of a document to a specific topic. We segment a document into a set of sub-documents using contextual features around the hyperlinks. This information is used to determine whether the crawler would fetch the documents that are linked from hyperlinks in an off-topic document.

주제 중심 웹 문서 수집기는 검색엔진에서 최신의 웹 문서 색인을 유지하는 대안방안으로 부상하고 있다. 그러나 주제 중심 웹 문서 수집기는 비 관심문서에서 연결된 관심문서들을 수집할 수 없는 문제점을 가지고 있다. 이러한 문제점은 문서의 구조적 특징을 고려하지 않아서 발생한다. 특히 문서분석 방법인 문서의 발생 횟수 및 역문헌 발생빈도는 이러한 문제를 야기하는 주요 원인이 된다 주제 중심 웹 문서 수집기의 성능을 향상하기 위해서 본 논문에서는 국소 정보기반의 문서 분할법을 제안한다. 본 논문에서는 문서를 하이퍼링크 주변의 문맥을 고려한 특징 정보들을 사용하여 여러 소각의 문서로 나눈다. 본 논문에서 제안하는 주제 중심 웹 문서 수집기는 나누어진 문서들을 이용하여 하이퍼링크가 관심문서를 가리키는 것인지를 판단하여 문서를 수집할 것인지를 판단한다.

Keywords

References

Matthew K Gray, 'Measuring the Growth of the Web, June 1993 to June 1995,' http://www.mit.edu/people/mkgray/growth
S. Chakrabarti, m. Ven den Berg And B.E. Dom, 'Focused Crawling: A New Approach to Topic' Specific Web Resource Discovery,' WWW-8. 1999
Paul De Bra, 'Information Retrieval in Distributed Hypertexts,' Proceeding of 4th RIAO Conference, 1994
M, Hersovici, 'The SharkSearch Algorithm An Application: Tailored Web Site Mapping,' Proceeding of 8th Int'l World Wide Web conference, pp.213-225, 1998
J. Cho, 'Efficient Crawling through URL ordering,' Computer Networks and ISDN Systems, Vol.30, pp.161-172, 1998 https://doi.org/10.1016/S0169-7552(98)00108-1
M. Dologenti, 'Focused Crawling Using Context Graphs,' Proceeding of 25th Int'l conference, Vwey Large Data Bases, Morgan Kaufmann, pp.527-534, 2000
A. McCallum, 'Building Domain-Specific Search Engines with Machine Learning Techniques,' Proceeding AAAI Symp. Intelligent Agents in Cyberspace, AAAI Press, pp.28-39, 1999
S. Mukherjea, 'WTMS: A System for Collecting and Analyzing Topic-Specific WEB Information,' Computer Networks, Vol.33, No.1-6, pp.457-471, 2000 https://doi.org/10.1016/S1389-1286(00)00035-9
Yiming Yang, Jan O. Pedersen, 'A Comparative Study on Feature Selection in Text Categorization,' Proceedings of the Fourteenth International Conference on Machine Learning, pp.412-420, 1997
Tom Mitchell, Machine Learning, McGraw Hill, pp.154-l99, 1998

The KIPS Transactions:PartB (정보처리학회논문지B)

A Focused Crawler by Segmentation of Context Information

주변정보 분할을 이용한 주제 중심 웹 문서 수집기

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)