Classification of Human Papillomavirus (HPV) Risk Type via Text Mining

  • Park, Seong-Bae (Biointelligence Lab., School of Computer Science and Engineering, Seoul National University) ;
  • Hwang, Sohyun (Biointelligence Lab., School of Computer Science and Engineering, Seoul National University) ;
  • Zhang, Byoung-Tak (Biointelligence Lab., School of Computer Science and Engineering, Seoul National University)
  • Published : 2003.12.01

Abstract

Human Papillomavirus (HPV) infection is known as the main factor for cervical cancer which is a leading cause of cancer deaths in women worldwide. Because there are more than 100 types in HPV, it is critical to discriminate the HPVs related with cervical cancer from those not related with it. In this paper, the risk type of HPVs using their textual explanation. The important issue in this problem is to distinguish false negatives from false positives. That is, we must find high-risk HPVs as many as possible though we may miss some low-risk HPVs. For this purpose, the AdaCost, a cost-sensitive learner is adopted to consider different costs between training examples. The experimental results on the HPV sequence database show that the consideration of costs gives higher performance. The improvement in F-score is higher than that of the accuracy, which implies that the number of high-risk HPVs found is increased.

Keywords

References

  1. Chan, S., Chew, S., Egawa. K., Grussendorf-Conen, E., Honda, Y., Rubben, A, Tan, K., and Bernard, H. (1997). Phylogenetic Analysis of the Human Papillomavirus Type 2 (HPV-2), HPV27, and HPV-57 Group, Which is Associated with Common Warts. Virology 239, 296-302 https://doi.org/10.1006/viro.1997.8896
  2. Fan, W., Stolfo, S., Zhang, J., and Chan, P. (1999). AdaCost:Misclassification Cost-Sensitive Boosting. In Proceedings of the 16th International Conference on Machine Learning 97-105
  3. Favre, M., Kremsdorf, D., Jablonska, S., Obalek, S., PehauArnaudet, G., Croissant, O., and Orth, G. (1990). Two New Human Papillomavrius Types (HPV54 and 55) Characterized from Genital Tumours Illustrate the Plurality of Genital HPVs. International Journal of Cancer 45, 40-46 https://doi.org/10.1002/ijc.2910450109
  4. Freund, Y. and Schapire, R. (1996). Experiments with a New Boosting Algorithm. In Proceedings of the 13th International Conference on Machine Learning 148-156
  5. Furumoto, H. and Irahara, M. (2002). Human Papillomavirus (HPV) and Cervical Cancer. The Journal of Medical Investigation. 49, 124-133
  6. Ishiji, T. (2000). Molecular Mechanism of Carcinogenesis by Human Papillomavirus-16. The Journal of Dermatology 27, 73-86 https://doi.org/10.1111/j.1346-8138.2000.tb02126.x
  7. Janicek, M.andAverette, H. (2001). Cervical Cancer: Prevention, Diagnosis, and Therapeutics. Cancer Journal for Clinicians 51, 92-114 https://doi.org/10.3322/canjclin.51.2.92
  8. Kim, Y.-H., Hahn, S.-Y., and Zhang, B.-T. (2000). Text Filtering by Boosting Naive Bayes Classifiers. In Proceedings of the 23rd AnnualInternational ACM SIGIR Conference on Research and Development in Information Retrieval 168-175
  9. Lang, K. (1995). Newsweeder: Learning to Filter Netnews. In Proceedings of the 12th International Conference on Machine Learning 331-339
  10. Levy, J., Fraenkel-Conrat, H., and Owens, R. (1994). Virology Prentice Hall
  11. Lewis, D. (1995). Evaluating and Optimizing Autonomous Text Classification System. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 298-306
  12. McCallum, A. and Nigam, K. (1998). Employing EM in PoolBased Active Learning for Text Classification. In Proceedings of the 15th International Conference on Machine Learning 350-358
  13. Meyer, T., Arndt, E, Christophers, E, Beckmann, E, Schroder, S., Gissmann, L., and Stockfleth, E. (1998). Association of Rare Human Papillomavirus Types with Genital Premalignant and Malignant Lesions. The Journal of Infectious Diseases 178, 252-255 https://doi.org/10.1086/517447
  14. Nuovo, G., Crum, C., De Villiers, E, and Silverstein, S. (1988). Isolation of a Novel Human Papillomavirus (Type 51) from a Cervical Condyloma. Journal of Virology 62,1452-1455
  15. Porter, M. (1980). An Algorithm for Suffix Stripping. Program 14, 130-137 https://doi.org/10.1108/eb046814
  16. Provost, F. and Fawcett, T. (1997). Analysis and Visualization of Classifier Performance: Comparison UnderImprecise Class and Cost Distributions. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining 43-48
  17. Schapire, R., Singer, Y., and Singhal, A. (1998). Boosting and Rocchio Applied to Text Filtering. In Proceedings of the 21st AnnualInternational ACMS/GIR Conference on Research and Development in Information Retrieval 215-223
  18. Schiffman, M., Bauer, H., Hoover, R., Glass, A, Cadell, D., Rush, B., Scott, D., Sherman, M., Kurman, R., and Wacholder, S. (1993). Epidemiologic Evidence Showing That Human Papillomavirus Infection Causes Most Cervical Intraepithelial Neoplasis. Journal of the National CancerInstitute 85, 958-964 https://doi.org/10.1093/jnci/85.12.958
  19. Ting, K.-M. and Zheng, T. (1998). Boosting Trees for CostSensitive Classifications. In Proceedings of the 10th European Conference onMachine Learning 190-195