URL Signatures for Improving URL Normalization

Soon, Lay-Ki;Lee, Sang-Ho;

한국정보과학회논문지:데이타베이스 (Journal of KIISE:Databases)

제36권2호
/
Pages.139-149
/
2009
/
1229-7739(pISSN)

한국정보과학회 (Korean Institute of Information Scientists and Engineers)

URL 정규화 향상을 위한 URL 서명

URL Signatures for Improving URL Normalization

순레이키 (숭실대학교 컴퓨터학부) ;
이상호 (숭실대학교 컴퓨터학부)

발행 : 2009.04.15

PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

URL은 표준 URL 정규화에서 정의한 단계에 의하여 구문적으로 정규화된다. 본 논문에서는 웹 페이지의 메타데이타를 이용하여 표준 URL 정규화를 보완하는 기법을 제안한다. 메타데이타는 HTML 분석 도중 추출될 수 있는 웹 페이지 본문과 페이지 크기이다. 첫 번째 실험에서는 웹 페이지 본문이 동등한 URL 식별에 효과적이라는 것을 보인다. 두 번째 실험에서는 웹 페이지 본문을 Message-Digest 5 알고리즘으로 해싱하여 URL 서명을 만들며, 동일한 서명을 가지는 URL은 동일하게 취급한다. 두 번째 실험 결과에서, 우리가 제시한 URL 서명이 표준 URL 정규화와 비교하여 32.94%의 중복 URL을 더 감소시킬 수 있음을 알 수 있었다.

In the standard URL normalization mechanism, URLs are normalized syntactically by a set of predefined steps. In this paper, we propose to complement the standard URL normalization by incorporating the semantically meaningful metadata of the web pages. The metadata taken into consideration are the body texts and the page size of the web pages, which can be extracted during HTML parsing. The results from our first exploratory experiment indicate that the body texts are effective in identifying equivalent URLs. Hence, given a URL which has undergone the standard normalization, we construct its URL signature by hashing the body text of the associated web page using Message-Digest algorithm 5 in the second experiment. URLs which share identical signatures are considered to be equivalent in our scheme. The results in the second experiment show that our proposed URL signatures were able to further reduce redundant URLs by 32.94% in comparison with the standard URL normalization.

키워드

참고문헌

Berners-Lee, T., Fielding, R, Masinter, L., 'Uniform Resource Identifier (URI): General Syntax,' available at Hhttp://gbiv.com/protocols/uri/rfc/rfc 3986.htmlH.
Lee, S. H., Kim, S. J, Hong, S. H., 'On URL Normalization,' in Proceedings of the 2005 International Conference on Computational Science and its Applications (ICCSA), Singapore, pp, 1076-1085, May 2005 https://doi.org/10.1007/11424826_115
Pant, G., Srinivasan, P., Menczer, F., 'Crawling the Web,' Web Dynamics 2004, pp, 153-178
Kim, S. J., Jeong, H. S., and Lee, S. H., 'Reliable Evaluations of URI. Normalization,' in Proceedings of the 2006 International Conference on Computational Science and its Applications (ICCSA), Glasgow, pp. 609-617, May 2006 https://doi.org/10.1007/11751649_67
Bar-Yossef, Z., Keidar, I., Schonfeld, U., 'Do Not Crawl in the DUST: Different URLs with Similar Text,' in the Proceedings of the International World Wide web Conference (WWW 2007), pp. 111 - 120, May 2007 https://doi.org/10.1145/1242572.1242588
Netcraft June 2008 Web Server Survey, available at: http://news.netcraft.com/archives/web_server_survey.html
Burner M., 'Crawling Towards Eternity: Building an archive of the World Wide Web,' Web Techniques Magazine, 2(5), May 1997
Chakrabarti, S., Mining the web, Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, Elservier, San Francisco, CA, 2003
The MD5 Message-Digest Algorithm, available at: http://tools.ietf.org/html/rfcl321
Web Data Extractor, available at: http://www.webextractor.corn/
Han, J., Kamber, M., Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, Elsevier, San Francisco, CA, 2006
Soon, L. K. and Lee, S. R., 'Identifying Equivalent URI.s using URI. Signatures,' to appear in the Proceedings of the 4th IEEE International Conference on Signal-Image Technology & Internet- Based Systems (SITIS 2008), Bali, Indonesia, December 2008 https://doi.org/10.1109/SITIS.2008.21

한국정보과학회논문지:데이타베이스 (Journal of KIISE:Databases)

URL 정규화 향상을 위한 URL 서명

URL Signatures for Improving URL Normalization

초록

키워드

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)