Estimating Coverage of the Web Search Services Using Near-Uniform Sampling of Web Documents

Jang, Sung-Soo;Kim, Kwang-Hyun;Lee, Joon-Ho;

doi:10.3745/KIPSTD.2008.15-D.3.305

The KIPS Transactions:PartD (정보처리학회논문지D)

Volume 15D Issue 3
/
Pages.305-312
/
2008
/
1598-2866(pISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Estimating Coverage of the Web Search Services Using Near-Uniform Sampling of Web Documents

균등한 웹 문서 샘플링을 이용한 웹 검색 서비스들의 커버리지 측정

장성수 (숭실대학교 대학원 컴퓨터학과) ;
김광현 (NHN (서치솔루션)) ;
이준호 (숭실대학교 컴퓨터학부)

Published : 2008.06.30

https://doi.org/10.3745/KIPSTD.2008.15-D.3.305 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Web documents with useful information are widely available on the internet and they are accessible with web search service. For this reason, web search services study better ways to collect more web documents, but have a difficulty figuring out the coverage of these web pages. This paper is intended to find ways to evaluate the current coverage assessment methods and suggest more effective coverage assessment technique that is, sampling internet web documents equally, monitoring how they are classified on web search services, in an attempt to assess both absolute and relative coverage of the web search engines. The paper also presents the comparison among Korean web search services using the suggested methods.the absolute and relative coverage was highest in Google followed by Naver and Empas. The result is expected to help estimating coverage of web search services.

인터넷에는 유익한 정보들이 포함된 웹 문서들이 공개되고 있으며, 이러한 웹 문서들은 웹 검색 서비스를 통하여 접근할 수 있다. 따라서 웹 검색 서비스들은 보다 많은 웹 문서들을 수집하기 위하여 노력하고 있으나, 이들은 수집된 웹 문서들의 커버리지를 파악하는데 많은 어려움을 겪고 있다. 따라서 본 논문에서는 기존의 커버리지 측정 방법들을 분석하고, 효과적인 커버리지 측정 방법을 제안한다. 즉, 인터넷에서 웹 문서를 균등하게 샘플링하고, 이 웹 문서들이 웹 검색 서비스에 색인되어 있는지를 조사함으로써 웹 검색 서비스들의 절대 및 상대 커버리지를 측정한다. 그리고 본 논문에서는 제안한 방법으로 국내 웹 검색 서비스들의 커버리지를 측정하여 비교하였으며, 그 결과 구글, 네이버, 엠파스 순으로 절대 및 상대 커버리지가 높게 나타났다. 이러한 본 논문의 결과는 웹 검색 서비스들의 커버리지를 측정하는데 도움이 될 것으로 기대된다.

Keywords

References

S. Lawrence and C. L. Giles, “Searching the World Wide Web,” in Science 280, pp.98-100, 1998 https://doi.org/10.1126/science.280.5360.98
K. Bharat and A. Broder, “A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines,” In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, Elsevier Science, pp.379-388, April, 1998
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian, “The Connectivity Server: fast access to linkage information on the Web,” In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, Elsevier Science, pp.469-477, April, 1998
S. Lawrence and C. L. Giles, “Accessibility of information on the web,” in Nature, 400, pp.107-107, 1999 https://doi.org/10.1038/21987
M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork, “On Near-Uniform Url Sampling,” in Computer Networks: The International Journal of Computer and Telecommunications Networking, pp.295-308, June 2000
김광현, 이준호, “웹 로봇의 성능 평가를 위한 방법론,” 정보과학회논문지, 제3권 제11호, 2004
M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork, “Measuring Index Quality Using Random Walks on the Web,” in Proceedings of the Eighth International World Wide Web Conference, pp.213-225, May, 1999
S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” In Proceedings of the 7th International World Wide Web Conference, Brisbane, Elsevier Science, Australia, pp.107-117, April, 1998
J. Carriere and R. Kazman, “Web query: Searching and visualizing the web through connectivity,” In Proceedings of the Sixth International World Wide Web Conference, Santa Clara, California, pp.701-711, April, 1997
J. Cho, H. Garcia-Molina, and L. Page, “Efficient Crawling Through URL Ordering,” In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, Elsevier Science, pp.161-172, April, 1998
L. Page, S. Brin, R. Moteani, and T. Winograd, “The PageRank citation ranking: bringing order to the Web,” Manuscript in Progress, 1993
배희진, 이진숙, 이준호, 박소연, “국내 웹 디렉토리의 커버리지 및 커버리지 중복성 분석,” 정보관리학회지, 제21권 제1호, pp.173-186, 2004
김성진, 이상호, “웹 로봇 구현 및 한국 웹 통계 보고,” 정보처리학회논문지, 제10-C권 제4호, pp.509-518, 2003
이준호, 김광현, 김지승, “다양한 한글 문서 색인 방법들에 대한평가,” 제5회 한국 과학기술 정보인프라 워크삽 학술발표논문집, 2002
이준호, 이충식, 한선화, 김진영, “문자 인식에 의해 구축된 한글 문서 데이터베이스에 대한 정보 검색,” 한국정보처리학회논문지 A, Vol.06, No.04, pp.833-840, 1999
김광현, 최정미, 이준호, “웹 문서 분석에 근거한 유해 웹 문서검출,” 정보처리학회논문지D, 제12-D권 제5호, pp.683-688, 2005 https://doi.org/10.3745/KIPSTD.2005.12D.5.683

The KIPS Transactions:PartD (정보처리학회논문지D)

Estimating Coverage of the Web Search Services Using Near-Uniform Sampling of Web Documents

균등한 웹 문서 샘플링을 이용한 웹 검색 서비스들의 커버리지 측정

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)