DOI QR코드

DOI QR Code

Website Classification based on Occurrence Frequency of Medical Terms and Hyperlinks in Webpage

웹페이지의 의학용어 출현 빈도와 하이퍼링크에 기반한 웹사이트 분류

  • Lee, In Keun (Department of Medical Informatics, Kyungpook National University) ;
  • Kim, Hwa Sun (Department of Medical Information Technology, Daegu Haany University) ;
  • Cho, Hune (Department of Medical Informatics, Kyungpook National University)
  • 이인근 (경북대학교 의료정보학과) ;
  • 김화선 (대구한의대학교 IT의료산업학과) ;
  • 조훈 (경북대학교 의료정보학과)
  • Received : 2012.11.30
  • Accepted : 2013.02.08
  • Published : 2013.04.25

Abstract

This study proposed a method to classify internet websites based on occurrence frequency of medical terms in the webpages and website structure composed with webpages and hyperlinks. The classification was done by using the suitability measure defined by three factors: (1)occurrence frequency of medical terms in the whole terms involved in a webpage, (2)occurrence frequency of medical terms in de-duplicated terms involved in the webpage, and (3)the number of hyperlinks to reach to a specific webpage from homepage. We conducted an experiment to verify the proposed method with the 80 websites registered in directories related to medical field and 127 websites in nonmedical field directories, and the experiment result showed 82.5 % of accuracy of the classification.

본 논문은 웹페이지에 포함된 의학용어의 출현 빈도와 웹페이지 간의 하이퍼링크로 이루어진 웹사이트의 구조에 기반하여 인터넷 웹사이트를 분류하는 방법을 제안한다. 제안하는 방법에서는 (1)웹페이지에 포함된 전체 용어에서의 의학용어 출현빈도와 (2)웹페이지에 포함된 중복을 제거한 용어에서의 의학용어 출현 빈도를 인자로 하여 웹페이지의 의학분야 적합도를 측정한다. 그리고 (3)홈페이지로부터 특정 웹페이지에 접근하기 위해 거쳐야 하는 하이퍼링크의 개수를 이용한 전체 웹페이지의 적합도 연산을 통해 웹사이트의 의학분야 적합도를 측정한다. 인터넷 포털 사이트의 디렉토리 검색 서비스에 등록된 80 개의 의학분야 웹사이트와 127 개의 비 의학분야 웹사이트를 대상으로 제안한 방법에 기반하여 웹사이트 분류 실험을 수행하였고, 82.5 %의 분류 정확률을 확인하였다.

Keywords

References

  1. X. Qi and B.D. Davison, "Web Page Classification: Features and algorithms," ACM Computing Surveys, vol. 41, pp. 1-31, 2009.
  2. S. Chakrabarti, B. van den Berg, and B. Dom, "Focused crawling: a new approach to topic-specific Web resource discovery," In Proceeding of the 8th International Converence on World Wide Web, pp. 1623-1640, 1999.
  3. D. Mladenic, "Turning Yahoo into an automatic Web-page classifier," In Proceedings of the European Conference on Artificial Intelligence, pp. 473-474, 1998.
  4. S.S. Lee, "Korean Document Classification Using Extended Vector Space Mode," KIPS Transactions: PartB, vol. 18-B, no. 2, pp. 93-108, 2011.
  5. C. Li, D.R. Byun, and S.C. Park, "BPNN Algorithm with SVD Technique for Korean Document categorization," Journal of the Korea Industrial Information System Society, vol. 15, no. 2, pp. 49-57, 2010.
  6. W.H. Lee, S.J. Chung, and D.U. An, "Harmful Document Classification Using the Harmful Word Filtering and SVM," KIPS Transactions: PartB, vol. 16-B, no. 2, pp. 85-92, 2009. https://doi.org/10.3745/KIPSTB.2009.16-B.1.85
  7. D.-H. Park, W.-S. Choi, H.-J. Kim, and S.-L. Lee, "Web Document Classification Based on Hangeul Morpheme and Keyword Analyses," KIPS Transactions: PartD, vol. 19-D, no. 4, pp. 263-270, 2012. https://doi.org/10.3745/KIPSTD.2012.19D.4.263
  8. N. Kim and J. Park, "Personal Information Detection by Using Naive Bayes Methodology," Journal of Intelligence and Information Systems, vol. 18, no. 1, pp. 91-107, 2012.
  9. K.S. Ko, M.G. Hwang, P.K. Kim, and C.H. Lee, "Semantic Topic Selection Method of Document for Classification," The J ournal of the Korean Institute of Information and Communication Engineering, vol. 11, no. 1, pp. 163-172, 2007.
  10. M. Ester, H.-P. Kriegel, and M. Schubert, "Web Site Mining: A new way to spot Competitors, Customers and Suppliers in the World Wide Web," In Proceedings of the 8th ACM SIGKDD, pp. 249-258, 2002.
  11. Y.H. Tian, T.J. Huang, W. Gao, J. Cheng, and P.B. Kang, "Two-Phase Web Site Classification Based on Hidden Markov Tree Models," In Proceedings of the IEEE/WIC International Converence on Web Intelligence, 2003.
  12. O.-W. Kwon and J.-H. Lee, "Text Categorization based on k-nearest Neighbor Approach for Web site Classification," Information Processing and Management, vol. 39, pp. 25-44, 2003. https://doi.org/10.1016/S0306-4573(02)00022-5
  13. E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer, "The Connectivity Sonar: Detecting Site Functionality by Structural Patterns," In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia, pp. 38-47, 2003.
  14. G. Salton, E.A. Fox, and H. Wu, "Extended Boolean Information Retrieval," Communications of the ACM, vol. 26, no. 12, pp. 1022-1036, 1983. https://doi.org/10.1145/182.358466
  15. "Espresso POS-K Tagger", Available: http://air.changwon.ac.kr/blog/2012/01/04/esspreso-pos-tagger-for-korean, [Accessed: July 26, 2012]
  16. 지제근, 알기쉬운의학용어 풀이집, 고려의학, 2004.